Papers
Topics
Authors
Recent
Search
2000 character limit reached

Sudo rm -rf: Efficient Networks for Universal Audio Source Separation

Published 14 Jul 2020 in eess.AS, cs.CL, cs.LG, cs.SD, and stat.ML | (2007.06833v1)

Abstract: In this paper, we present an efficient neural network for end-to-end general purpose audio source separation. Specifically, the backbone structure of this convolutional network is the SUccessive DOwnsampling and Resampling of Multi-Resolution Features (SuDoRMRF) as well as their aggregation which is performed through simple one-dimensional convolutions. In this way, we are able to obtain high quality audio source separation with limited number of floating point operations, memory requirements, number of parameters and latency. Our experiments on both speech and environmental sound separation datasets show that SuDoRMRF performs comparably and even surpasses various state-of-the-art approaches with significantly higher computational resource requirements.

Citations (117)

Summary

  • The paper introduces SuDoRM-RF, an efficient convolutional neural network architecture using successive downsampling and resampling for multi-resolution audio feature extraction.
  • Experiments show SuDoRM-RF achieves performance comparable to state-of-the-art models with significantly reduced computational demands (FLOPs and parameters).
  • The architecture's low memory usage and latency make it highly suitable for deployment in resource-constrained environments like mobile and embedded devices.

Efficient Networks for Universal Audio Source Separation

The paper "Sudo rm -rf: Efficient Networks for Universal Audio Source Separation" introduces a novel neural network architecture termed SuDoRM-RF, designed for efficient and effective audio source separation tasks. The architecture leverages specific innovations in convolutional networks to offer high-fidelity separation with reduced computational demands, a significant advancement considering the typically resource-intensive nature of audio separation models.

Network Architecture and Methodology

The model's core innovation is the SUccessive DOwnsampling and Resampling of Multi-Resolution Features (SuDoRM-RF) methodology, which operates through an efficient convolutional framework. It adopts a one-dimensional convolution operation for processing audio data, enabling the network to capture multi-resolution features while maintaining computational efficiency. By foregoing the traditional depth-wise separable convolutions with large dilation factors, which often introduce artifacts, SuDoRM-RF implements depth-wise convolutions coupled with iterative resampling strategies to maximize the temporal receptive field effectively.

The architecture includes an adaptive encoder and decoder, ensuring seamless end-to-end audio processing. The encoder transforms the input audio into a latent representation, subsequently passed through the separation module that estimates masks for different audio sources. These masks facilitate reconstructing the separated sources in the decoder stage.

Experimental Validation

The efficacy of SuDoRM-RF was validated across multiple datasets, including speech separation tasks on WSJ0-2mix and environmental sound separation using the ESC50 dataset. The results underscore that SuDoRM-RF can match or even surpass the performance of more computationally intensive models like ConvTasNet and DPRNN. Notably, SuDoRM-RF delivers these results with a fraction of the computational load, rendering it suitable for deployment in resource-constrained environments.

Computational Efficiency

A significant focus of the paper is on optimizing computational resources:

  • FLOPs and Parameters: SuDoRM-RF models require exponentially fewer floating point operations (FLOPs) compared to other state-of-the-art architectures while maintaining a low parameter count, making them highly efficient for training and inference.
  • Memory and Latency: The network demonstrates low memory usage and latency during processing, essential for applications on mobile and embedded devices, where computational resources are limited.

Theoretical and Practical Implications

The SuDoRM-RF architecture highlights a critical shift towards resource-efficient deep learning models capable of maintaining high performance. The approach of leveraging multi-resolution feature extraction without extensive computational overhead suggests a broader applicability in other domains requiring real-time processing on edge devices.

Future Directions

The paper acknowledges the potential for further refinement and application of SuDoRM-RF in real-time audio processing tasks. There is scope for extending this architecture to more complex separation challenges and integrating it with other machine learning paradigms, such as meta-learning, to dynamically adjust network configurations based on specific computational constraints and target tasks.

In conclusion, the SuDoRM-RF architecture represents a significant step forward in audio source separation, offering a practical balance between performance and efficiency. Its deployment capability in constrained environments signals a promising avenue for broader applications in audio processing and beyond.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.