Deep convolutional neural networks for predominant instrument recognition in polyphonic music (1605.09507v3)

Published 31 May 2016 in cs.SD, cs.CV, cs.LG, and cs.NE

Abstract: Identifying musical instruments in polyphonic music recordings is a challenging but important problem in the field of music information retrieval. It enables music search by instrument, helps recognize musical genres, or can make music transcription easier and more accurate. In this paper, we present a convolutional neural network framework for predominant instrument recognition in real-world polyphonic music. We train our network from fixed-length music excerpts with a single-labeled predominant instrument and estimate an arbitrary number of predominant instruments from an audio signal with a variable length. To obtain the audio-excerpt-wise result, we aggregate multiple outputs from sliding windows over the test audio. In doing so, we investigated two different aggregation methods: one takes the average for each instrument and the other takes the instrument-wise sum followed by normalization. In addition, we conducted extensive experiments on several important factors that affect the performance, including analysis window size, identification threshold, and activation functions for neural networks to find the optimal set of parameters. Using a dataset of 10k audio excerpts from 11 instruments for evaluation, we found that convolutional neural networks are more robust than conventional methods that exploit spectral features and source separation with support vector machines. Experimental results showed that the proposed convolutional network architecture obtained an F1 measure of 0.602 for micro and 0.503 for macro, respectively, achieving 19.6% and 16.4% in performance improvement compared with other state-of-the-art algorithms.

Authors (3)

Yoonchang Han (4 papers)
Jaehun Kim (17 papers)
Kyogu Lee (75 papers)

Citations (198)

View on Semantic Scholar

Summary

The paper demonstrates that a ConvNet architecture can outperform traditional methods, achieving 19.6% and 16.4% improvements in micro and macro F1 scores.
The study employs single-labeled training data with variable-length audio segments and sliding window aggregation to capture spectral characteristics.
The findings highlight the potential of deep learning to advance music information retrieval tasks such as music search, transcription, and genre identification.

Deep Convolutional Neural Networks for Predominant Instrument Recognition in Polyphonic Music

The paper "Deep Convolutional Neural Networks for Predominant Instrument Recognition in Polyphonic Music," authored by Yoonchang Han, Jaehun Kim, and Kyogu Lee, presents an evaluative paper of using convolutional neural networks (ConvNets) for identifying predominant musical instruments in complex, polyphonic audio environments. This research is part of the broader field of Music Information Retrieval (MIR), which seeks to enable machines to analyze and understand music effectively, a domain that notably includes tasks such as music search and automatic music transcription.

Methodology

The authors designed a ConvNet architecture to learn spectral characteristics from audio recordings, using single-labeled data for training and testing on multi-labeled outputs in real-world Western music recordings. The network was trained on audio excerpts of fixed length but evaluated using variable-length segments to enhance flexibility. Key to this approach is the effective aggregation of outputs, accomplished via sliding windows over the audio signal, using two strategies: averaging predictions class-wise and normalizing summed outputs per class.

Experimental Findings

This paper is underpinned by extensive experimentation. The authors use a dataset containing 10,000 audio excerpts across 11 instruments, allowing for a robust evaluation of the network's validity. The proposed model significantly outperformed traditional methods that rely on spectral features and source separation combined with support vector machines, achieving an F1 score of 0.602 for micro and 0.503 for macro measures. These results indicate a respective performance enhancement of 19.6% and 16.4%, showcasing the potential of ConvNets in this domain.

The analysis examined several independent variables, including the size of the analysis window, identification thresholds, and different activation functions such as tanh, ReLU, and its variants. The optimal configuration emerged using a 1-second analysis window, LReLU activation with a specific leakage parameter, and the class-wise sum with normalization for aggregation.

Implications and Future Directions

The practical implications of this research span improving music search mechanisms and enhancing genre identification, source separation, and transcription processes. As the authors noted, the ability to recognize predominant instruments allows for tailored approaches to MIR problems.

Theoretically, this work contributes to the understanding of ConvNets applied to time-frequency representations, affirming their strength in learning hierarchical audio data structures. The paper opens avenues for future work, especially in refining the aggregation strategies and incorporating adaptive thresholds per instrument to accommodate the diverse sonic characteristics prevalent in music.

The potential for further advancements using these methodologies is significant. As deep learning continues to evolve, particularly with advancements in neural architectures and more comprehensive datasets, the efficacy and application scope in MIR tasks such as polyphonic music transcription and automatic genre classification are expected to expand. Moreover, future explorations could focus on comparative evaluations across genres and integrating domain-specific features that further leverage the unique properties of music signals.

PDF Markdown