Universal Sound Separation (1905.03330v2)

Published 8 May 2019 in cs.SD, cs.LG, eess.AS, and stat.ML

Abstract: Recent deep learning approaches have achieved impressive performance on speech enhancement and separation tasks. However, these approaches have not been investigated for separating mixtures of arbitrary sounds of different types, a task we refer to as universal sound separation, and it is unknown how performance on speech tasks carries over to non-speech tasks. To study this question, we develop a dataset of mixtures containing arbitrary sounds, and use it to investigate the space of mask-based separation architectures, varying both the overall network architecture and the framewise analysis-synthesis basis for signal transformations. These network architectures include convolutional long short-term memory networks and time-dilated convolution stacks inspired by the recent success of time-domain enhancement networks like ConvTasNet. For the latter architecture, we also propose novel modifications that further improve separation performance. In terms of the framewise analysis-synthesis basis, we explore both a short-time Fourier transform (STFT) and a learnable basis, as used in ConvTasNet. For both of these bases, we also examine the effect of window size. In particular, for STFTs, we find that longer windows (25-50 ms) work best for speech/non-speech separation, while shorter windows (2.5 ms) work best for arbitrary sounds. For learnable bases, shorter windows (2.5 ms) work best on all tasks. Surprisingly, for universal sound separation, STFTs outperform learnable bases. Our best methods produce an improvement in scale-invariant signal-to-distortion ratio of over 13 dB for speech/non-speech separation and close to 10 dB for universal sound separation.

Authors (7)

Ilya Kavalerov (6 papers)
Scott Wisdom (33 papers)
Hakan Erdogan (32 papers)
Brian Patton (17 papers)
Kevin Wilson (18 papers)
Jonathan Le Roux (82 papers)
John R. Hershey (40 papers)

Citations (177)

View on Semantic Scholar

Summary

Universal Sound Separation: An Overview of Methodology and Findings

The paper "Universal Sound Separation" presents a paper aimed at addressing the challenge of discriminating between different sounds in complex acoustic mixtures using advanced deep learning techniques. Recent accomplishments in speech enhancement and separation, primarily focused on speech signals, have highlighted the potential and limitations of these methods when applied to a broader category of non-speech audio tasks. The concept of universal sound separation is crucial for enabling machines to distinguish between arbitrary sounds, encapsulating hundreds of sound types, which surpasses the field of speech-based separation.

Dataset Creation and Methodology

The authors built a comprehensive dataset sourced from the Pro Sound library, encapsulating a wide variety of real-world sounds. This dataset, which excludes ambiences and environments due to their inherent composite nature, facilitates the examination of separation techniques on monaural recordings, thereby eliminating directional cues. Each mixture consisted of short clips, and specific protocols were followed to ensure randomness and diversity in combination, thus preparing a robust testbed for experimentation.

The paper explores mask-based separation systems underpinned by deep neural networks, experimenting with combinations of network architectures, specifically convolutional LSTMs and time-dilated convolutional networks (TDCN), along with analysis-synthesis bases such as STFT and learnable bases. These explorations focus on optimizing window size to enhance performance across tasks, particularly speech/non-speech versus arbitrary sound separations. Noteworthy, the TDCN++ model introduces several architectural modifications, enhancing separation efficacy through improved initialization, normalization techniques, and iterative processing strategies.

Key Findings and Implications

Several novel insights emerge from the findings. Contrary to expectations, for the universal sound separation task, the performance metrics favor the STFT over learned bases, a notable deviation from speech/non-speech separation where learned bases offer superior performance. The optimal parameter, window size, diverges considerably between the task domains, with short windows generally preferred for arbitrary sounds regardless of the basis type.

The iTDCN++ models demonstrated significant efficacy, achieving substantial enhancements in scale-invariant signal-to-distortion ratio (SI-SDRi). For sporadic and arbitrary sounds, better results are observed with short frames suggesting their adeptness at capturing the transient nature of many non-speech audio components. Iterative enhancement using iTDCN++ and deeper model structures further consolidate separation quality, showcasing potential avenues for future exploration.

Theoretical and Practical Implications

This research carves pathways for more generalized approaches to sound separation. The success of iterated processing and the superiority of learned bases in structured sound domains open discussions on the interplay between architectural complexity and sound diversity. Recognizing distinctions between speech and other sound categories may improve universal sound separation systems, possibly facilitating applications ranging from automated audio editing to surveillance systems in non-speech-dominated environments.

Future studies may explore iterative network designs and explore embedding-based techniques that transcend limitations of time-frequency domain strategies, striving towards universal sound separation's long-term efficacy. Additionally, evolving neural architectures to handle large-scale, multi-class sound mixtures remains a pivotal challenge for advancing machine hearing technologies.

The paper sets a salient foundation for ongoing research, providing essential insights into the delicate balances required in network design and data preprocessing. These thoughtful explorations are vital for realizing the ultimate vision of a comprehensive sound separation paradigm applicable across different auditory landscapes.

PDF Markdown