End-to-end learning for music audio tagging at scale (1711.02520v4)

Published 7 Nov 2017 in cs.SD and eess.AS

Abstract: The lack of data tends to limit the outcomes of deep learning research, particularly when dealing with end-to-end learning stacks processing raw data such as waveforms. In this study, 1.2M tracks annotated with musical labels are available to train our end-to-end models. This large amount of data allows us to unrestrictedly explore two different design paradigms for music auto-tagging: assumption-free models - using waveforms as input with very small convolutional filters; and models that rely on domain knowledge - log-mel spectrograms with a convolutional neural network designed to learn timbral and temporal features. Our work focuses on studying how these two types of deep architectures perform when datasets of variable size are available for training: the MagnaTagATune (25k songs), the Million Song Dataset (240k songs), and a private dataset of 1.2M songs. Our experiments suggest that music domain assumptions are relevant when not enough training data are available, thus showing how waveform-based models outperform spectrogram-based ones in large-scale data scenarios.

Citations (176)

View on Semantic Scholar

Summary

The paper demonstrates that waveform models with minimal assumptions outperform spectrogram models on million-scale music datasets.
It evaluates two model designs across varying dataset sizes, revealing domain-specific advantages under limited data conditions.
The findings advocate a data-driven approach in choosing between assumption-free and domain-informed architectures for optimal tagging performance.

End-to-End Learning for Music Audio Tagging at Scale

The paper presents an in-depth examination of end-to-end learning architectures for music audio tagging, focusing particularly on the impact of dataset scale on model performance. The authors confront the conventional challenge faced by deep learning systems due to data constraints, particularly in processing raw audio waveforms. By exploiting a substantial dataset of 1.2 million tracks annotated with musical labels, they evaluate two distinct design strategies: assumption-free models that process waveforms with small convolutional filters, and models leveraging domain knowledge, employing log-mel spectrograms with convolutional neural networks (CNNs) designed to learn timbral and temporal features.

Contributions and Methodology

The research delineates the architectural configurations into two primary components: waveform-based models, devoid of domain shifts, and spectrogram-based models incorporating specific musical signal attributes in the model design. The paper homes in on how these architectures behave across different dataset scales: MagnaTagATune (MTT) with 25k songs, the Million Song Dataset (MSD) with 240k songs, and the private 1.2M-song dataset.

For the waveform models, the paper highlights the utilization of sample-level front-ends with minimal preconceptions about the signal's temporal or spectral structure. Conversely, the spectrogram models capitalize on domain knowledge by using various convolutional filters to capture identified musically relevant features, such as spectral and temporal attributes in spectrogram representations.

Numerical Results

The paper reveals a clear delineation between performances based on dataset sizes. The waveform models demonstrate remarkable results with large-scale datasets, specifically the 1.2M-song collection, surpassing spectrogram-based models across metrics. For instance, a notable ROC AUC of 92.50% was achieved by the waveform model trained on 1 million songs, compared to 92.17% for the spectrogram model. These findings affirm that, when sufficient data are available, models making fewer a priori assumptions about the data can surpass those that tightly integrate domain-specific knowledge.

In contrast, with smaller datasets such as MTT, spectrogram models outperform their waveform counterparts, aligning with findings in previous literature that domain-specific model designs are beneficial in scenarios with limited data. Moreover, these spectrogram models achieved state-of-the-art performances, thus validating the merit of domain knowledge in model design under constrained data settings.

Implications and Future Directions

The paper’s outcomes have profound implications for future research and model development in the field of music information retrieval and beyond. Practically, they suggest a paradigm shift whereby the design choice between leveraging domain-specific knowledge versus employing a more versatile, assumption-free approach should be data-driven. Notably, this insight extends into the broader AI research community, emphasizing the importance of large-scale datasets for advancing model capacities and supporting the development of models that generalize well without extensive pre-defined constraints.

Moreover, the research invites further inquiry into balancing model complexity, resource consumption, and data availability. Future work might explore methodologies for dynamic model adaptability, harnessing hybrid approaches where domain knowledge plays a complementary role to assumption-free strategies, potentially reducing computational overheads without sacrificing generalizability. Empirical validation of such models across various audio tasks could help solidify these findings and drive new innovations in deep learning for audio and other data-heavy applications.

In conclusion, this paper provides significant insights into the comparative performance of waveform and spectrogram models for music audio tagging, emphasizing dataset size as a critical factor. This research paves the way for a more nuanced understanding of model design choices in the context of varying data scales and invites further exploration into optimal hybrid architectures and training strategies.

PDF Markdown