Benchmarks and leaderboards for sound demixing tasks (2305.07489v2)

Published 12 May 2023 in cs.SD, cs.LG, and eess.AS

Abstract: Music demixing is the task of separating different tracks from the given single audio signal into components, such as drums, bass, and vocals from the rest of the accompaniment. Separation of sources is useful for a range of areas, including entertainment and hearing aids. In this paper, we introduce two new benchmarks for the sound source separation tasks and compare popular models for sound demixing, as well as their ensembles, on these benchmarks. For the models' assessments, we provide the leaderboard at https://mvsep.com/quality_checker/, giving a comparison for a range of models. The new benchmark datasets are available for download. We also develop a novel approach for audio separation, based on the ensembling of different models that are suited best for the particular stem. The proposed solution was evaluated in the context of the Music Demixing Challenge 2023 and achieved top results in different tracks of the challenge. The code and the approach are open-sourced on GitHub.

Authors (3)

Roman Solovyev (8 papers)
Alexander Stempkovskiy (3 papers)
Tatiana Habruseva (3 papers)

Citations (5)

View on Semantic Scholar

Summary

The paper introduces new benchmarks and leaderboards for evaluating sound demixing models using the Signal-to-Distortion Ratio, enhancing model comparisons.
It demonstrates that ensemble approaches often outperform single models in vocal separation, underscoring the benefit of combined strategies.
The proposed Synth MVSep and Multisong MVSep datasets provide diverse, randomized testbeds for advancing audio signal processing in entertainment and assistive tech.

An Overview of "Benchmarks and Leaderboards for Sound Demixing Tasks"

The paper, "Benchmarks and Leaderboards for Sound Demixing Tasks," introduces enhancements to the task of music demixing, a critical domain in audio signal processing. It primarily focuses on the development of new benchmarks and comparative analyses of existing and novel approaches to sound source separation. The research discussed is significant for areas such as entertainment and assistive hearing technologies.

Introduction

The authors address the task of separating mixed audio signals into component tracks like vocals, drums, bass, and accompaniment, which has diverse applications. For instance, improving the intelligibility of movie dialogues or enhancing karaoke experiences. This paper contributes by introducing two new datasets—Synth MVSep and Multisong MVSep—as benchmarks for evaluating sound demixing models and their ensembles. Additionally, the research explores ensemble methodologies and provides a documented leaderboard to publicly compare models' efficacy.

Sound Demixing Models

An overview of popular models such as Demucs, MDX-Net, Ultimate Vocal Remover, and Spleeter is given, highlighting their architectures and core functionalities. Most of these models adopt variations of encoder-decoder architectures and have distinct capacities for vocal and instrumental separation. The paper mentions significant datasets used in previous challenges, which have been heavily exploited, thereby emphasizing the need for new benchmarks.

New Benchmarks

The Synth MVSep dataset is synthesized by mixing free vocal and instrumental samples, offering a randomized testbed for audio separation algorithms. In contrast, the Multisong MVSep dataset comprises real compositions from various music genres, presenting a wide variety of test cases for model evaluation. The existence of these datasets and an open leaderboard facilitates the benchmarking of different algorithms, encouraging unbiased comparisons among models.

Evaluation Metrics

The evaluation primarily uses the Signal-to-Distortion Ratio (SDR), a standard metric in audio separation studies, to quantify model performance across the separated audio stems. This paper provides an elaborate evaluation of both single models and ensembles and proposes a novel ensemble approach that combines predictions from several models to improve separation quality.

Results and Discussion

The results indicate that ensemble models generally outperform single models in terms of SDR. In particular, configurations involving MDX variations demonstrate superior performance in vocal separation tasks. Synth MVSep and Multisong MVSep benchmarks reveal that different model configurations achieve optimal results depending on the specific separation tasks, whether for vocals or instrumental stems like bass and drums.

Sound Demixing Challenge 2023

The paper discusses the authors' submission to the Sound Demixing Challenge 2023, where their ensemble approach achieved commendable results. The methodology included unique steps like pre-processing vocal separation before tackling other stems, highlighting the effectiveness of the proposed strategy in competitive evaluation settings.

Implications and Future Work

The implications of this paper are twofold: practically, it offers immediate utility for developing more nuanced applications in entertainment and assistive technologies. Theoretically, the new benchmarks and documented results lay groundwork for future research in model generalization and domain-specific adaptations.

The open-sourcing of their code and approaches on GitHub and the ongoing updates to their leaderboards suggest a commitment to transparency and community engagement. This work adds valuable resources to the audio signal processing research community and sets the stage for future advancements in AI-driven sound separation technologies.

PDF Markdown

Related Papers

GitHub

GitHub - ZFTurbo/MVSEP-MDX23-music-separation-model: Model for MDX23 music separation contest (527 stars)

Tweets

https://twitter.com/AudioAndSpeech/status/1788157828546416847

https://twitter.com/ArxivSound/status/1788057136708100579

https://twitter.com/verticalized/status/1901057780787843402