musicnn: Pre-trained convolutional neural networks for music audio tagging (1909.06654v1)

Published 14 Sep 2019 in cs.SD, cs.CL, and eess.AS

Abstract: Pronounced as "musician", the musicnn library contains a set of pre-trained musically motivated convolutional neural networks for music audio tagging: https://github.com/jordipons/musicnn. This repository also includes some pre-trained vgg-like baselines. These models can be used as out-of-the-box music audio taggers, as music feature extractors, or as pre-trained models for transfer learning. We also provide the code to train the aforementioned models: https://github.com/jordipons/musicnn-training. This framework also allows implementing novel models. For example, a musically motivated convolutional neural network with an attention-based output layer (instead of the temporal pooling layer) can achieve state-of-the-art results for music audio tagging: 90.77 ROC-AUC / 38.61 PR-AUC on the MagnaTagATune dataset --- and 88.81 ROC-AUC / 31.51 PR-AUC on the Million Song Dataset.

Authors (2)

Jordi Pons (36 papers)
Xavier Serra (82 papers)

Citations (86)

View on Semantic Scholar

Summary

musicnn: Pre-trained Convolutional Neural Networks for Music Audio Tagging

The paper introduces "musicnn," a library of pre-trained convolutional neural networks (CNNs) specifically designed for music audio tagging tasks. This research contributes to the field of Music Information Retrieval (MIR) by providing a set of musically motivated CNN models and vgg-like baseline models for music feature extraction and transfer learning.

Overview of the musicnn Library

The musicnn library includes several pre-trained models: MTT_musicnn, MSD_musicnn, MSD_musicnn_big, MTT_vgg, and MSD_vgg. These models are trained using two distinct datasets: the MagnaTagATune dataset, comprising around 19,000 songs, and the Million Song Dataset, with approximately 200,000 songs. The availability of both musically motivated CNNs and baseline vgg-like networks allows researchers to choose models based on specific use cases, whether employing them as audio taggers, feature extractors, or for transfer learning.

Functionality and Applications

1. Music Audio Tagging: The library allows for efficient tag estimation directly from an audio file. Using Python or command-line interfaces, users can extract the top-N tags associated with a given audio input.

2. Feature Extraction: The models can extract various features, such as timbral and temporal features from musicnn models, and different pooling layer outputs from vgg models. These features can be valuable for subsequent music analysis tasks.

3. Transfer Learning: The research demonstrates transfer learning by constructing SVM classifiers on features extracted from pre-trained models. The approach shows competitive performance on the GTZAN dataset, with MSD_musicnn achieving a 77.24% accuracy, approximating the results of models trained on significantly larger datasets.

Training and Performance

The training of the musicnn models is openly documented and available, emphasizing reproducibility and accessibility. The models exhibit state-of-the-art performance on the MagnaTagATune and Million Song Datasets, with notable ROC-AUC and PR-AUC scores:

MTT_musicnn: 90.69 ROC-AUC / 38.44 PR-AUC
MSD_musicnn_big: 88.41 ROC-AUC / 30.02 PR-AUC

Additionally, an attention-based architecture variant is noted to slightly outperform the original setup, suggesting potential areas for architectural exploration.

Implications and Future Directions

The implications of this work are significant within the MIR community, offering robust tools for music tagging and feature extraction that are both high-performing and accessible. The pre-trained models lower the barrier for engaging with sophisticated audio processing tasks, catering to diverse research and industry applications.

In terms of future developments, further exploration into larger and more diverse datasets for training could enhance model generalizability. Additionally, the integration of novel neural architectures, such as attention mechanisms, may offer pathways for improving accuracy and broadening the scope of audio-related tasks achievable by these models.

The provision of the training framework paves the way for an experimental ecosystem, fostering community involvement in optimizing and customizing models for targeted research objectives. This approach aligns with broader trends in machine learning, where community-driven innovation plays a vital role in advancing state-of-the-art technologies.

PDF Markdown

Related Papers

Find Related Papers