An Empirical Evaluation of Deep Learning Architectures for Music Emotion  Recognition Using Audio Data

Amrithaa, T.; Himaathri, P.; Gunawardhana, M.U.K.; Senthooran, V.

dc.contributor.author	Amrithaa, T.
dc.contributor.author	Himaathri, P.
dc.contributor.author	Gunawardhana, M.U.K.
dc.contributor.author	Senthooran, V.
dc.date.accessioned	2026-03-24T12:01:32Z
dc.date.available	2026-03-24T12:01:32Z
dc.date.issued	2026
dc.identifier.uri	http://drr.vau.ac.lk/handle/123456789/2024
dc.description.abstract	The Music Emotion Recognition (MER) system aims to automatically recognize the emotional content of music. MER has gained significant interest in recent times due to its applicability in music recommendation systems, intelligent music composition, and music therapy. Although recent studies have shown promising results in MER systems, emotion classification remains challenging, especially in Tamil songs. In this paper, a comparative analysis of deep learning methods for audio-based music emotion recognition is conducted using a carefully constructed, balanced dataset. An MER dataset was constructed with four emotion classes (happy, calm, angry, and sad). Each class comprises 400 three-second audio clips extracted from 20 unique Tamil movie songs, resulting in a total of 1,600 samples. Stratified splitting was applied to split the data into training, validation, and testing sets. Log mel spectrograms were used as input representations for audio data. Standardization was applied to each audio sample individually. Data augmentation was added during training with on-the-fly mask and noise augmentation to evaluate the effectiveness of data augmentation in MER systems. The transfer learning-based approach, using the pretrained YAMNet_Dense model, was compared and tested against three different deep learning-based models, including the convolutional neural network (CNN_Spec), a convolutional recurrent neural network (CRNN_Spec), and a CNN with a self-attention mechanism (CNN_Attention_Spec), all of which utilize the spectrogram representation. The experimental results demonstrates that the proposed transfer learning-based approach outperforms the other three deep learning-based models, achieving an accuracy of 81.67% on the test set and a macro-F1 measure of 0.8165. The deep learning-based models, including the CNN, CRNN, and attention-based models, performed poorly, achieving macro-F1 measures of less than 0.38, indicating that these models are highly sensitive to the limited amount of training data and the spectral overlap between classes. The results indicate that data augmentation does not consistently enhance the performance of the evaluated deep learning models. In contrast, the findings highlight the effectiveness of transfer learning for music emotion recognition under limited data conditions and emphasize the importance of robust feature representations and comprehensive evaluation beyond overall accuracy.	en_US
dc.language.iso	en	en_US
dc.publisher	Korea Database Strategy Society (KDSS)	en_US
dc.subject	Audio databases	en_US
dc.subject	Emotion recognition	en_US
dc.subject	Feature extraction	en_US
dc.subject	Music information retrieval	en_US
dc.subject	Music emotion	en_US
dc.title	An Empirical Evaluation of Deep Learning Architectures for Music Emotion Recognition Using Audio Data	en_US
dc.type	Conference abstract	en_US
dc.identifier.proceedings	32nd International Conference on IT Applications and Management	en_US