An Empirical Evaluation of Deep Learning Architectures for Music Emotion  Recognition Using Audio Data

Amrithaa, T.; Himaathri, P.; Gunawardhana, M.U.K.; Senthooran, V.

An Empirical Evaluation of Deep Learning Architectures for Music Emotion Recognition Using Audio Data

Amrithaa, T.; Himaathri, P.; Gunawardhana, M.U.K.; Senthooran, V.

URI: http://drr.vau.ac.lk/handle/123456789/2024

Date: 2026

Abstract:

The Music Emotion Recognition (MER) system aims to automatically recognize the emotional content of music. MER has gained significant interest in recent times due to its applicability in music recommendation systems, intelligent music composition, and music therapy. Although recent studies have shown promising results in MER systems, emotion classification remains challenging, especially in Tamil songs. In this paper, a comparative analysis of deep learning methods for audio-based music emotion recognition is conducted using a carefully constructed, balanced dataset. An MER dataset was constructed with four emotion classes (happy, calm, angry, and sad). Each class comprises 400 three-second audio clips extracted from 20 unique Tamil movie songs, resulting in a total of 1,600 samples. Stratified splitting was applied to split the data into training, validation, and testing sets. Log mel spectrograms were used as input representations for audio data. Standardization was applied to each audio sample individually. Data augmentation was added during training with on-the-fly mask and noise augmentation to evaluate the effectiveness of data augmentation in MER systems. The transfer learning-based approach, using the pretrained YAMNet_Dense model, was compared and tested against three different deep learning-based models, including the convolutional neural network (CNN_Spec), a convolutional recurrent neural network (CRNN_Spec), and a CNN with a self-attention mechanism (CNN_Attention_Spec), all of which utilize the spectrogram representation. The experimental results demonstrates that the proposed transfer learning-based approach outperforms the other three deep learning-based models, achieving an accuracy of 81.67% on the test set and a macro-F1 measure of 0.8165. The deep learning-based models, including the CNN, CRNN, and attention-based models, performed poorly, achieving macro-F1 measures of less than 0.38, indicating that these models are highly sensitive to the limited amount of training data and the spectral overlap between classes. The results indicate that data augmentation does not consistently enhance the performance of the evaluated deep learning models. In contrast, the findings highlight the effectiveness of transfer learning for music emotion recognition under limited data conditions and emphasize the importance of robust feature representations and comprehensive evaluation beyond overall accuracy.

Show full item record