musicnn: Pre-trained Convolutional Neural Networks for Music Audio Tagging
The paper introduces "musicnn," a library of pre-trained convolutional neural networks (CNNs) specifically designed for music audio tagging tasks. This research contributes to the field of Music Information Retrieval (MIR) by providing a set of musically motivated CNN models and vgg-like baseline models for music feature extraction and transfer learning.
Overview of the musicnn Library
The musicnn library includes several pre-trained models: MTT_musicnn, MSD_musicnn, MSD_musicnn_big, MTT_vgg, and MSD_vgg. These models are trained using two distinct datasets: the MagnaTagATune dataset, comprising around 19,000 songs, and the Million Song Dataset, with approximately 200,000 songs. The availability of both musically motivated CNNs and baseline vgg-like networks allows researchers to choose models based on specific use cases, whether employing them as audio taggers, feature extractors, or for transfer learning.
Functionality and Applications
1. Music Audio Tagging: The library allows for efficient tag estimation directly from an audio file. Using Python or command-line interfaces, users can extract the top-N tags associated with a given audio input.
2. Feature Extraction: The models can extract various features, such as timbral and temporal features from musicnn models, and different pooling layer outputs from vgg models. These features can be valuable for subsequent music analysis tasks.
3. Transfer Learning: The research demonstrates transfer learning by constructing SVM classifiers on features extracted from pre-trained models. The approach shows competitive performance on the GTZAN dataset, with MSD_musicnn achieving a 77.24% accuracy, approximating the results of models trained on significantly larger datasets.
Training and Performance
The training of the musicnn models is openly documented and available, emphasizing reproducibility and accessibility. The models exhibit state-of-the-art performance on the MagnaTagATune and Million Song Datasets, with notable ROC-AUC and PR-AUC scores:
- MTT_musicnn: 90.69 ROC-AUC / 38.44 PR-AUC
- MSD_musicnn_big: 88.41 ROC-AUC / 30.02 PR-AUC
Additionally, an attention-based architecture variant is noted to slightly outperform the original setup, suggesting potential areas for architectural exploration.
Implications and Future Directions
The implications of this work are significant within the MIR community, offering robust tools for music tagging and feature extraction that are both high-performing and accessible. The pre-trained models lower the barrier for engaging with sophisticated audio processing tasks, catering to diverse research and industry applications.
In terms of future developments, further exploration into larger and more diverse datasets for training could enhance model generalizability. Additionally, the integration of novel neural architectures, such as attention mechanisms, may offer pathways for improving accuracy and broadening the scope of audio-related tasks achievable by these models.
The provision of the training framework paves the way for an experimental ecosystem, fostering community involvement in optimizing and customizing models for targeted research objectives. This approach aligns with broader trends in machine learning, where community-driven innovation plays a vital role in advancing state-of-the-art technologies.