Music Genre Classification using Machine Learning Techniques (1804.01149v1)
Abstract: Categorizing music files according to their genre is a challenging task in the area of music information retrieval (MIR). In this study, we compare the performance of two classes of models. The first is a deep learning approach wherein a CNN model is trained end-to-end, to predict the genre label of an audio signal, solely using its spectrogram. The second approach utilizes hand-crafted features, both from the time domain and the frequency domain. We train four traditional machine learning classifiers with these features and compare their performance. The features that contribute the most towards this multi-class classification task are identified. The experiments are conducted on the Audio set data set and we report an AUC value of 0.894 for an ensemble classifier which combines the two proposed approaches.
Summary
- The paper demonstrates that CNN-based approaches significantly outperform traditional classifiers by learning features directly from MEL spectrogram images.
- It evaluates two CNN strategies—transfer learning and fine tuning—highlighting robust performance improvements over hand-crafted feature methods.
- An ensemble of CNN and XGBoost models achieves the highest AUC and accuracy, underscoring the benefit of combining diverse techniques.
This paper, "Music Genre Classification using Machine Learning Techniques" (Music Genre Classification using Machine Learning Techniques, 2018), explores different machine learning approaches for automatically categorizing music based on its genre. The practical motivation behind this work is to help organize large music databases and enhance features in audio streaming services like Spotify and iTunes. The paper compares two main strategies: an end-to-end deep learning approach using Convolutional Neural Networks (CNNs) on audio spectrograms and a traditional machine learning approach using hand-crafted audio features fed into various classifiers.
The dataset used is a subset of the Audio Set dataset [gemmeke2017audio], specifically focusing on audio clips tagged with one of seven music genres: Pop, Rock, Hip Hop, Techno, Rhythm Blues, Vocal, and Reggae. The dataset contains a total of 40,540 10-second clips sourced from YouTube videos. A practical challenge in using this dataset is that the raw audio is not provided directly; instead, YouTube IDs and timestamps are given. Implementing this requires a data retrieval pipeline using tools like youtube-dl
to download the videos and ffmpeg
to extract the audio segments in .wav
format. This results in a significant amount of data (around 34 GB) even for 10-second clips. The audio is then pre-processed with a pre-emphasis filter y(t)=x(t)−α∗x(t−1) with α=0.97 to boost high frequencies and improve the signal-to-noise ratio.
The first approach leverages deep learning by treating audio spectrograms as images. A spectrogram is a visual representation of the audio signal's frequency content over time. The paper specifically uses MEL spectrograms, generated with parameters like a sampling rate of 22050 Hz, a window size (n_fft
) of 2048, a hop size of 512 (75% overlap), a Hann window function, and 96 MEL bins. These 216x216 spectrogram images are then used as input to a CNN. The architecture employed is based on VGG-16 [simonyan2014very], utilizing its convolutional base layers pretrained on ImageNet, followed by a custom feed-forward network with a 512-unit hidden layer.
Two implementation settings for the CNN are explored:
- Transfer Learning: The weights of the VGG-16 convolutional base are frozen, and only the weights of the new feed-forward layers are trained on the music genre classification task.
- Fine Tuning: The pretrained VGG-16 weights are used as an initialization, but all weights in both the convolutional base and the new layers are allowed to be updated during training.
To prevent overfitting, L2 regularization (λ=0.001) and Dropout (rate of 0.3) are applied. The models are trained using the ADAM optimizer with a batch size of 32 for 10 epochs, using a 90/5/5 train/validation/test split. Model selection is based on performance on the validation set, typically choosing the epoch with the lowest validation loss and highest validation accuracy (found to be epoch 4 in their experiments). A simple feed-forward network taking flattened spectrogram pixels as input serves as a baseline to demonstrate the advantage of CNNs for this image-like data.
The second approach relies on extracting hand-crafted features from the audio signals using the librosa
Python library. These features are categorized into time-domain and frequency-domain features:
- Time-Domain: Central moments (mean, standard deviation, skewness, kurtosis of amplitude), Zero Crossing Rate (average and standard deviation across frames), Root Mean Square Energy (average and standard deviation across frames), and Tempo (Beats Per Minute).
- Frequency-Domain: Mel-Frequency Cepstral Coefficients (MFCCs - 20 coefficients), Chroma Features (mean and standard deviation of energy across 12 pitch classes), Spectral Centroid (mean and standard deviation), Spectral Bandwidth (mean and standard deviation), Spectral Contrast (mean and standard deviation across frequency bands), and Spectral Roll-off (mean and standard deviation).
For each spectral feature, the mean and standard deviation computed across fixed-size frames (2048 points with 512 hop size) are used as the final features. This results in a feature vector that is then fed into various traditional machine learning classifiers:
- Logistic Regression (LR): Implemented using a one-vs-rest strategy for multi-class classification.
- Random Forest (RF): An ensemble method using bootstrap aggregation and random feature subsets for individual decision trees.
- Extreme Gradient Boosting (XGBoost): A sequential ensemble method that iteratively builds decision trees, focusing on errors from previous trees.
- Support Vector Machines (SVM): Uses an RBF kernel to map data into a higher dimension for linear separation, also implemented as one-vs-rest.
Model performance is evaluated using Accuracy, F-score, and Area Under the Receiver Operating Characteristic curve (AUC). The results demonstrate that the CNN-based models (VGG-16 Transfer Learning and Fine Tuning) significantly outperform the traditional machine learning models trained on hand-crafted features, achieving AUC values around 0.89 compared to the best feature-based model (XGBoost) at 0.865. This highlights the power of CNNs to learn relevant features directly from the spectrogram representation. The baseline feed-forward network performs poorly (AUC 0.759), confirming the benefit of convolutional layers for processing the image-like spectrogram data. Among the feature-engineered models, XGBoost yields the best results.
An analysis of feature importance using the XGBoost model reveals that frequency-domain features, particularly MFCCs, are the most significant contributors to classification performance. Spectral contrast and tempo are also identified as important features. An ablation paper shows that using only the top 30 most important features still achieves a performance (AUC 0.845, Accuracy 0.55) close to using all 97 hand-crafted features (AUC 0.865, Accuracy 0.59), suggesting potential for dimensionality reduction in feature-based approaches. Furthermore, a comparison confirms that frequency-domain features (AUC 0.857) are substantially more effective than time-domain features (AUC 0.731) for this task.
Analyzing the confusion matrices shows that both the best CNN and XGBoost models perform well on genres like 'Rock' but struggle to differentiate between similar genres such as 'Hip Hop' and 'Pop'. This suggests that some musical pieces may genuinely blend characteristics of multiple genres, making hard classification difficult.
Finally, an ensemble classifier combining the best CNN model (VGG-16 Transfer Learning) and the best feature-engineered model (XGBoost) by averaging their predicted probabilities achieves the highest AUC of 0.894 and Accuracy of 0.65, demonstrating the benefit of combining diverse models.
For practical implementation, this research suggests a pipeline involving:
- Data Acquisition: Retrieving audio segments from sources using tools like
youtube-dl
andffmpeg
. - Pre-processing: Applying filters (like pre-emphasis) and converting audio to a suitable representation (spectrograms for CNNs or extracting hand-crafted features). Spectrogram generation requires careful selection of parameters (
sr
,n_fft
,hop_size
, etc.), while feature extraction can be done efficiently using libraries likelibrosa
. - Model Training:
- For the deep learning path: Implementing a CNN architecture (e.g., based on VGG-16 or similar), potentially using transfer learning from image classification tasks. This requires GPU resources for efficient training. Regularization techniques are crucial.
- For the feature engineering path: Extracting the defined features and training classifiers like XGBoost or SVM. This might be less computationally intensive during training than CNNs if feature extraction is optimized, but requires domain knowledge to select and engineer features.
- Inference: Feeding new audio data through the chosen pipeline (spectrogram + CNN or features + classifier) to get genre predictions. The ensemble approach would require running both the CNN and feature-based inference and averaging probabilities.
The paper's findings indicate that while hand-crafted features are valuable (especially frequency-domain ones like MFCCs), the end-to-end deep learning approach using CNNs on spectrograms is more powerful, likely due to its ability to learn complex patterns directly from the data representation. The ensemble provides a marginal but notable improvement, suggesting that the two approaches capture slightly different aspects of the music. The use of noisy, real-world data from YouTube highlights the robustness of the evaluated methods but also points to potential areas for future improvement, such as more advanced noise reduction techniques. The open-sourced code provides a valuable resource for developers looking to implement these techniques.
Related Papers
- Music Genre Classification: A Comparative Analysis of CNN and XGBoost Approaches with Mel-frequency cepstral coefficients and Mel Spectrograms (2024)
- MATT: A Multiple-instance Attention Mechanism for Long-tail Music Genre Classification (2022)
- Enriched Music Representations with Multiple Cross-modal Contrastive Learning (2021)
- Multi-label Music Genre Classification from Audio, Text, and Images Using Deep Features (2017)
- Music Genre Classification: Training an AI model (2024)