Demucs: Deep Extractor for Music Sources with extra unlabeled data remixed (1909.01174v1)

Published 3 Sep 2019 in cs.SD, cs.LG, eess.AS, and stat.ML

Abstract: We study the problem of source separation for music using deep learning with four known sources: drums, bass, vocals and other accompaniments. State-of-the-art approaches predict soft masks over mixture spectrograms while methods working on the waveform are lagging behind as measured on the standard MusDB benchmark. Our contribution is two fold. (i) We introduce a simple convolutional and recurrent model that outperforms the state-of-the-art model on waveforms, that is, Wave-U-Net, by 1.6 points of SDR (signal to distortion ratio). (ii) We propose a new scheme to leverage unlabeled music. We train a first model to extract parts with at least one source silent in unlabeled tracks, for instance without bass. We remix this extract with a bass line taken from the supervised dataset to form a new weakly supervised training example. Combining our architecture and scheme, we show that waveform methods can play in the same ballpark as spectrogram ones.

Citations (73)

View on Semantic Scholar

Summary

The paper demonstrates advancements in music source separation by introducing a deep waveform model that significantly improves SDR performance.
It employs a convolutional-recurrent architecture with GLU activations and bidirectional LSTM to robustly extract instrumental components.
The integration of unlabeled data through remixing expands the training dataset, enabling performance nearing that of spectrogram-based methods.

Analysis of Demucs: Deep Extractor for Music Sources with Extra Unlabeled Data Remixed

The paper introduces Demucs, a novel architecture for music source separation, which demonstrates notable improvements in extracting instrumental stems directly from waveform data. Source separation in music entails isolating individual sound sources—like drums, bass, vocals, and other accompaniments—from a mixed audio track. Traditional methods often rely on spectrogram-based processing to create masks for separation, whereas Demucs works directly on waveforms and attempts to overcome waveform method limitations.

Key Contributions

Architectural Improvements: Demucs employs a convolutional and recurrent structure, aiming to outperform existing waveform-based models, specifically Wave-U-Net, by leveraging increased depth and modified activation functions. By incorporating GLU activations and a bidirectional LSTM, the architecture is designed to robustly synthesize waveform components.
Utilization of Unlabeled Data: The paper integrates unlabeled music effectively through a novel remixing approach, crafting weakly supervised training examples by combining silent sections from unlabeled tracks with labeled source segments. This strategy extends the training dataset significantly without requiring direct supervision over each new track.

Numerical Results

Demucs demonstrated an impressive enhancement of 1.6 points in SDR (Signal to Distortion Ratio) compared to Wave-U-Net on the MusDB benchmark, indicating superior performance in separating musical components from waveform inputs. The analysis indicated almost parity with spectrogram-based methods, a notable achievement for waveform approaches.

Implications and Future Directions

The research suggests potential for waveform-based source separation to close the performance gap with spectrogram methods, at least in contexts with limited labeled data. Practically, Demucs could simplify source separation workflows by minimizing preprocessing requirements, enhancing applications in music production where raw audio manipulation is advantageous. Furthermore, expanding the dataset using unlabeled data remixing could lead to more adaptable models capable of performing well across diverse musical genres and structures.

In terms of theoretical implications, this approach motivates further exploration of convolutional and recurrent enhancements tailored for audio processing. Such developments could enhance the granularity and fidelity of separations achieved by direct waveform synthesis, possibly influencing sound generation and processing advancements. Future research might investigate larger volume datasets or integrate adversarial training mechanisms to complement the remixing strategy and thus push boundary performance further.

In conclusion, the paper presents significant contributions to music source separation, employing innovative techniques that exploit neural network architectures to harness unlabeled data effectively. Demucs represents a step forward in waveform-based processing, indicating potential avenues for further exploration and enhancement within AI-driven audio applications.

PDF Markdown

Related Papers

YouTube

Show All Videos