Abstract:
The Music Emotion Recognition (MER) system aims to automatically recognize the emotional content of music. MER
has gained significant interest in recent times due to its applicability in music recommendation systems, intelligent
music composition, and music therapy. Although recent studies have shown promising results in MER systems, emotion
classification remains challenging, especially in Tamil songs. In this paper, a comparative analysis of deep learning
methods for audio-based music emotion recognition is conducted using a carefully constructed, balanced dataset. An MER
dataset was constructed with four emotion classes (happy, calm, angry, and sad). Each class comprises 400 three-second
audio clips extracted from 20 unique Tamil movie songs, resulting in a total of 1,600 samples. Stratified splitting was
applied to split the data into training, validation, and testing sets. Log mel spectrograms were used as input representations
for audio data. Standardization was applied to each audio sample individually. Data augmentation was added during
training with on-the-fly mask and noise augmentation to evaluate the effectiveness of data augmentation in MER systems.
The transfer learning-based approach, using the pretrained YAMNet_Dense model, was compared and tested against
three different deep learning-based models, including the convolutional neural network (CNN_Spec), a convolutional
recurrent neural network (CRNN_Spec), and a CNN with a self-attention mechanism (CNN_Attention_Spec), all of which
utilize the spectrogram representation. The experimental results demonstrates that the proposed transfer learning-based
approach outperforms the other three deep learning-based models, achieving an accuracy of 81.67% on the test set and a
macro-F1 measure of 0.8165. The deep learning-based models, including the CNN, CRNN, and attention-based models,
performed poorly, achieving macro-F1 measures of less than 0.38, indicating that these models are highly sensitive to
the limited amount of training data and the spectral overlap between classes. The results indicate that data augmentation
does not consistently enhance the performance of the evaluated deep learning models. In contrast, the findings highlight
the effectiveness of transfer learning for music emotion recognition under limited data conditions and emphasize the
importance of robust feature representations and comprehensive evaluation beyond overall accuracy.