Music Source Separation in the Waveform Domain (1911.13254v2)

Published 27 Nov 2019 in cs.SD, cs.LG, eess.AS, and stat.ML

Abstract: Source separation for music is the task of isolating contributions, or stems, from different instruments recorded individually and arranged together to form a song. Such components include voice, bass, drums and any other accompaniments.Contrarily to many audio synthesis tasks where the best performances are achieved by models that directly generate the waveform, the state-of-the-art in source separation for music is to compute masks on the magnitude spectrum. In this paper, we compare two waveform domain architectures. We first adapt Conv-Tasnet, initially developed for speech source separation,to the task of music source separation. While Conv-Tasnet beats many existing spectrogram-domain methods, it suffersfrom significant artifacts, as shown by human evaluations. We propose instead Demucs, a novel waveform-to-waveform model,with a U-Net structure and bidirectional LSTM.Experiments on the MusDB dataset show that, with proper data augmentation, Demucs beats allexisting state-of-the-art architectures, including Conv-Tasnet, with 6.3 SDR on average, (and up to 6.8 with 150 extra training songs, even surpassing the IRM oracle for the bass source).Using recent development in model quantization, Demucs can be compressed down to 120MBwithout any loss of accuracy.We also provide human evaluations, showing that Demucs benefit from a large advantagein terms of the naturalness of the audio. However, it suffers from some bleeding,especially between the vocals and other source.

Authors (4)

Nicolas Usunier (53 papers)
Léon Bottou (48 papers)
Francis Bach (249 papers)
Alexandre Défossez (26 papers)

Citations (250)

View on Semantic Scholar

Summary

The paper demonstrates that Demucs, with its encoder-decoder and LSTM design, outperforms an adapted Conv-Tasnet in separating musical sources.
The methodology leverages a synthesis-oriented, U-Net-inspired structure to bypass phase issues typical of spectrogram-based approaches.
Empirical evaluation on MusDB shows Demucs achieving an SDR of 6.3 (up to 6.8 with additional training), highlighting its promise for real-time audio applications.

Analysis of "Music Source Separation in the Waveform Domain"

The pursuit of effectively separating musical sources within the waveform domain represents a significant area of exploration within the field of audio processing. The paper "Music Source Separation in the Waveform Domain" presents a comprehensive analysis of two distinct architectures, Conv-Tasnet and Demucs, both adapted to address the task of music source separation. In what follows, we provide an in-depth critique of the methodologies, results, and implications of this paper, primarily focusing on their impact within the domain.

Overview of Methodologies

The paper contrasts two architectures: an adapted version of Conv-Tasnet, a model initially conceptualized for speech separation, and Demucs, a novel U-Net-inspired structure leveraging bidirectional LSTMs to process and separate music directly in the waveform domain. The adaption of Conv-Tasnet for stereophonic music significantly diversifies its application beyond monophonic speech, enhancing its convolutional layers to accommodate the higher sampling rates typical in music.

Demucs, proposed as an alternative, incorporates a convolutional encoder-decoder framework intertwined with LSTM layers, akin to structures used in music synthesis. It is noteworthy that while Conv-Tasnet adheres closely to the philosophy of masking waveform features, Demucs embraces an overview-oriented perspective. This approach allows Demucs to surpass spectrogram-based methods by directly modeling audio waveforms, thus bypassing the limitations of phase reconstruction associated with frequency domain methods.

Results and Performance

The paper details a robust empirical evaluation conducted on the MusDB dataset. Notably, Demucs demonstrates marked improvements over existing methods across the dataset, achieving an average SDR of 6.3, and even reaching 6.8 when incorporating additional training data. These results underscore Demucs' efficacy in outperforming spectrogram-based architectures, including the Ideal Ratio Mask (IRM) oracle in certain instances, particularly for the bass source—an outcome of substantial significance given the historical challenges associated with accurately isolating low-frequency components in mixed signals.

By contrast, Conv-Tasnet, while competitive, encounters artifacts such as broadband noise and clarity deficiencies, particularly in drums and bass segments. The inclusion of human evaluations further substantiates the quality and naturalness advantages presented by Demucs, despite observable shortcomings like source bleeding.

Practical and Theoretical Implications

The implications of this paper are twofold. Practically, the development of Demucs highlights the potential for real-time music separation in applications ranging from audio restoration and archival to creative remixing tools. Equally relevant is the framework's ability to perform with significant accuracy while maintaining a manageably compact model size through quantization.

Theoretically, this research contributes to the ongoing discourse around the feasibility of waveform-based techniques in outperforming traditional spectrogram-centric approaches. The findings challenge the preconception that spectrogram-based separation should dominate due to their mature phase treatment strategies, suggesting a paradigm shift might be on the horizon underpinned by advancements in time-domain processing frameworks.

Future Prospects

Looking ahead, refining the Demucs model to mitigate inter-source bleeding and further improve separation fidelity across other complex sources presents a promising avenue for exploration. Expanding the dataset diversity or exploring semi-supervised techniques may further bolster its adaptability and generalization. Moreover, cross-pollination between waveform and spectrogram domains might yield hybrid architectures, amalgamating the strengths of both domains to address current limitations.

In conclusion, the paper provides a substantive contribution to music source separation. Demucs stands as a testament to the burgeoning potential of direct waveform operations, opening novel pathways for future research and practical application in the field of audio signal processing.

PDF Markdown

Related Papers

Tweets

https://twitter.com/juand4v/status/1864598937170006182