An Empirical Evaluation of Deep Learning Architectures for Music Emotion Recognition Using Audio Data

Show simple item record

dc.contributor.author Amrithaa, T.
dc.contributor.author Himaathri, P.
dc.contributor.author Gunawardhana, M.U.K.
dc.contributor.author Senthooran, V.
dc.date.accessioned 2026-03-24T12:01:32Z
dc.date.available 2026-03-24T12:01:32Z
dc.date.issued 2026
dc.identifier.uri http://drr.vau.ac.lk/handle/123456789/2024
dc.description.abstract The Music Emotion Recognition (MER) system aims to automatically recognize the emotional content of music. MER has gained significant interest in recent times due to its applicability in music recommendation systems, intelligent music composition, and music therapy. Although recent studies have shown promising results in MER systems, emotion classification remains challenging, especially in Tamil songs. In this paper, a comparative analysis of deep learning methods for audio-based music emotion recognition is conducted using a carefully constructed, balanced dataset. An MER dataset was constructed with four emotion classes (happy, calm, angry, and sad). Each class comprises 400 three-second audio clips extracted from 20 unique Tamil movie songs, resulting in a total of 1,600 samples. Stratified splitting was applied to split the data into training, validation, and testing sets. Log mel spectrograms were used as input representations for audio data. Standardization was applied to each audio sample individually. Data augmentation was added during training with on-the-fly mask and noise augmentation to evaluate the effectiveness of data augmentation in MER systems. The transfer learning-based approach, using the pretrained YAMNet_Dense model, was compared and tested against three different deep learning-based models, including the convolutional neural network (CNN_Spec), a convolutional recurrent neural network (CRNN_Spec), and a CNN with a self-attention mechanism (CNN_Attention_Spec), all of which utilize the spectrogram representation. The experimental results demonstrates that the proposed transfer learning-based approach outperforms the other three deep learning-based models, achieving an accuracy of 81.67% on the test set and a macro-F1 measure of 0.8165. The deep learning-based models, including the CNN, CRNN, and attention-based models, performed poorly, achieving macro-F1 measures of less than 0.38, indicating that these models are highly sensitive to the limited amount of training data and the spectral overlap between classes. The results indicate that data augmentation does not consistently enhance the performance of the evaluated deep learning models. In contrast, the findings highlight the effectiveness of transfer learning for music emotion recognition under limited data conditions and emphasize the importance of robust feature representations and comprehensive evaluation beyond overall accuracy. en_US
dc.language.iso en en_US
dc.publisher Korea Database Strategy Society (KDSS) en_US
dc.subject Audio databases en_US
dc.subject Emotion recognition en_US
dc.subject Feature extraction en_US
dc.subject Music information retrieval en_US
dc.subject Music emotion en_US
dc.title An Empirical Evaluation of Deep Learning Architectures for Music Emotion Recognition Using Audio Data en_US
dc.type Conference abstract en_US
dc.identifier.proceedings 32nd International Conference on IT Applications and Management en_US


Files in this item

This item appears in the following Collection(s)

  • IITAMS - 2026 [39]
    International Conference on IT Applications and Management

Show simple item record

Search


Browse

My Account