The Cocktail Fork Problem: Three-Stem Audio Separation for Real-World Soundtracks (2110.09958v2)

Published 19 Oct 2021 in eess.AS and cs.SD

Abstract: The cocktail party problem aims at isolating any source of interest within a complex acoustic scene, and has long inspired audio source separation research. Recent efforts have mainly focused on separating speech from noise, speech from speech, musical instruments from each other, or sound events from each other. However, separating an audio mixture (e.g., movie soundtrack) into the three broad categories of speech, music, and sound effects (understood to include ambient noise and natural sound events) has been left largely unexplored, despite a wide range of potential applications. This paper formalizes this task as the cocktail fork problem, and presents the Divide and Remaster (DnR) dataset to foster research on this topic. DnR is built from three well-established audio datasets (LibriSpeech, FMA, FSD50k), taking care to reproduce conditions similar to professionally produced content in terms of source overlap and relative loudness, and made available at CD quality. We benchmark standard source separation algorithms on DnR, and further introduce a new multi-resolution model to better address the variety of acoustic characteristics of the three source types. Our best model produces SI-SDR improvements over the mixture of 11.0 dB for music, 11.2 dB for speech, and 10.8 dB for sound effects.

Citations (34)

View on Semantic Scholar

Summary

The paper introduces a novel MRX architecture that leverages multi-resolution STFT to achieve SI-SDR improvements of up to 11.2 dB across speech, music, and sound effects.
It constructs the Divide and Remaster (DnR) dataset by combining high-fidelity audio from LibriSpeech, FMA, and FSD50K under realistic mixing conditions.
The study’s findings advance single-channel audio separation techniques, paving the way for improved automated editing and interactive media experiences.

Essay on "The Cocktail Fork Problem: Three-Stem Audio Separation for Real-World Soundtracks"

The paper "The Cocktail Fork Problem: Three-Stem Audio Separation for Real-World Soundtracks" addresses a notable challenge in the field of audio source separation that has been relatively underexplored: differentiating a complex audio mixture into its core components—speech, music, and sound effects. This problem, aptly named the "cocktail fork problem," stems from the more generalized cocktail party problem but narrows its focus to the distinctive categories pertinent to many audio-visual media forms, such as films and podcasts.

Dataset and Problem Formulation

The researchers introduce the Divide and Remaster (DnR) dataset, which is constructed by amalgamating three publicly available datasets: LibriSpeech for speech, Free Music Archive (FMA) for music, and FSD50K for sound effects. Each of these datasets contributes quality audio sampled at 44.1 kHz, which is crucial for applications where high-fidelity output is desired. The authors emphasize the importance of maintaining realistic mixing conditions in DnR by considering the overlap and relative loudness of the components, factors critical for translating the training results to practical implementations.

Methodology and Architectural Innovations

The paper benchmarks existing source separation algorithms and introduces a novel multi-resolution CrossNet architecture (denoted as MRX). This architecture stands out as it integrates multiple time-frequency resolutions via Short-Time Fourier Transform (STFT), allowing the model to better accommodate the diverse acoustic features of speech, music, and sound effects. By employing multiple STFT resolutions concurrently, MRX effectively handles audio features that manifest differently across temporal and spectral domains.

MRX utilizes multiple BLSTM stacks and decoding paths, enhancing its ability to separate sources by leveraging the additional temporal information captured at varying resolutions. This design choice has shown tangible improvements over single-resolution counterparts, as evidenced by SI-SDR gains.

Experimental Results

The numerical results underscore the superiority of MRX, which demonstrates significant SI-SDR improvements when compared to other models like single-resolution XUMX and Conv-TasNet. The authors report SI-SDR improvements of 11.0 dB for music, 11.2 dB for speech, and 10.8 dB for sound effects when using MRX architecture on the DnR dataset. Such results are suggestive of the potential for multi-resolution strategies to markedly enhance source separation accuracy in complex audio environments.

Importantly, the paper also provides an analysis of scenarios where components overlap partially, assisting in understanding the conditions under which separation efficacy might vary. This detailed examination helps elucidate the strengths and limits of the proposed approaches, offering clear insights into model performance across different audio scenes.

Implications and Future Directions

The implications of this research are considerable. Successful separation of speech, music, and sound effects in real-world, single-channel audio offers promising avenues for augmented reality, automated content editing, and enhanced interactive media experiences. The structured approach to modeling and dataset development can facilitate future advancements in tasks such as transcription, remixing, and sound description systems with applications ranging from personal media consumption to professional audio editing.

Looking forward, an integrative approach combining advanced source separation techniques with downstream tasks like acoustic scene understanding and semantic audio-visual modeling could yield systems capable of intricate auditory content manipulation and analysis. The continuous refinement of datasets such as DnR and models like MRX will likely spur further innovation in these domains, advancing the intersection of artificial intelligence and auditory signal processing.

PDF Markdown

Related Papers

YouTube

Show All Videos