Multi-scale Multi-band DenseNets for Audio Source Separation (1706.09588v1)

Published 29 Jun 2017 in cs.SD, cs.CL, and cs.MM

Abstract: This paper deals with the problem of audio source separation. To handle the complex and ill-posed nature of the problems of audio source separation, the current state-of-the-art approaches employ deep neural networks to obtain instrumental spectra from a mixture. In this study, we propose a novel network architecture that extends the recently developed densely connected convolutional network (DenseNet), which has shown excellent results on image classification tasks. To deal with the specific problem of audio source separation, an up-sampling layer, block skip connection and band-dedicated dense blocks are incorporated on top of DenseNet. The proposed approach takes advantage of long contextual information and outperforms state-of-the-art results on SiSEC 2016 competition by a large margin in terms of signal-to-distortion ratio. Moreover, the proposed architecture requires significantly fewer parameters and considerably less training time compared with other methods.

Citations (148)

View on Semantic Scholar

Summary

The paper proposes a novel Multi-scale Multi-band DenseNet (MMDenseNet) architecture for audio source separation.
MMDenseNet achieves state-of-the-art audio source separation performance on the DSD100 dataset, outperforming previous methods like BLEND in SDR.
The MMDenseNet architecture significantly reduces parameter count and training time compared to existing models, offering a more efficient solution for audio source separation applications.

Multi-scale Multi-band DenseNets for Audio Source Separation

The paper by Naoya Takahashi and Yuki Mitsufuji explores a novel architecture for tackling the complex problem of audio source separation. The paper focuses on leveraging convolutional neural networks (CNNs) in the form of an extended DenseNet architecture to achieve superior performance in separate audio components from a mixture. DenseNet, originally successful in the domain of image classification, is adapted in this research to meet the particular demands of audio signal processing.

Audio source separation has seen various methodological advancements, with traditional approaches including Gaussian modeling, non-negative factorization, and kernel additive modeling. However, the recent success of deep neural networks (DNNs) in this area has inspired further innovation. The authors identify the deficiencies in current approaches, such as the extensive training time and high parameter counts of LSTM-based methods, and respond with an architecture that is parameter-efficient and faster to train.

Key Innovations and Architecture

The proposed architecture, termed as Multi-scale Multi-band DenseNet (MMDenseNet), introduces several modifications to the DenseNet framework. The core innovations include:

Multi-scale DenseNet: This enhancement involves the incorporation of up-sampling layers, block skip connections, and band-dedicated dense blocks. The framework allows the network to capture both long contexts and detailed time-frequency structures in the audio data, thus efficiently dealing with the high-dimensional inputs and outputs typical of audio signals.
Band-specific Modeling: Given that different frequency bands in audio signals display distinct characteristics, the authors incorporate a multi-band approach. They design separate dense blocks for each frequency band as well as the entire spectrum, allowing the network to capture distinct features pertinent to each band. This band-specific processing enhances the network's ability to model various audio components effectively.
Improved Computational Efficiency: Despite the expansion in capability, the proposed MMDenseNet architecture significantly reduces the number of parameters and the training time compared to other state-of-the-art models, such as BLSTM and BLEND, thus offering a more deployable solution.

Experimental Outcomes

The experimental evaluation on the DSD100 dataset, a benchmark for the SiSEC 2016 competition, demonstrates the superiority of the proposed method. The MMDenseNet outperforms all previous methods, including the BLEND approach, with substantial gains in the signal-to-distortion ratio (SDR) across several audio sources. When enriched with additional data, referred to as MMDenseNet+, the architecture achieves even greater separation accuracy.

Furthermore, the model exhibits a drastic reduction in parameters and training duration. For instance, MMDenseNet eclipses the BLEND model with just under 4% of its parameter count, highlighting the advantage of the proposed deep learning approach in resource-constrained settings.

Implications and Future Directions

The paper presents significant implications for the field of audio signal processing, particularly in applications demanding efficient and accurate source separation such as music remixing, audio enhancement, and automated transcription systems. The successful adaptation of DenseNet, traditionally an image domain model, to the audio domain could inspire similar domain-transcending approaches.

The work also opens avenues for future research focused on exploiting multi-scale and multi-band architectures in broader contexts, including real-time audio processing and the integration of such models into consumer electronics. Incorporating adaptive architectures based on signal characteristics and exploring deeper integration of machine learning with audio domain knowledge can further augment the capabilities demonstrated by MMDenseNet.

In summary, the paper makes a meaningful contribution to the state of the art in audio source separation, with a robust, more efficient architectural proposition that leverages the strengths of DenseNet. This paper provides a solid foundation for ongoing research aiming to push the envelope in the application of neural networks for complex audio tasks.

Related Papers

YouTube

Show All Videos