Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Whole Is Greater than the Sum of Its Parts: Improving Music Source Separation by Bridging Network (2305.07855v2)

Published 13 May 2023 in eess.AS and cs.SD

Abstract: This paper presents the crossing scheme (X-scheme) for improving the performance of deep neural network (DNN)-based music source separation (MSS) with almost no increasing calculation cost. It consists of three components: (i) multi-domain loss (MDL), (ii) bridging operation, which couples the individual instrument networks, and (iii) combination loss (CL). MDL enables the taking advantage of the frequency- and time-domain representations of audio signals. We modify the target network, i.e., the network architecture of the original DNN-based MSS, by adding bridging paths for each output instrument to share their information. MDL is then applied to the combinations of the output sources as well as each independent source; hence, we called it CL. MDL and CL can easily be applied to many DNN-based separation methods as they are merely loss functions that are only used during training and do not affect the inference step. Bridging operation does not increase the number of learnable parameters in the network. Experimental results showed that the validity of Open-Unmix (UMX), densely connected dilated DenseNet (D3Net) and convolutional time-domain audio separation network (Conv-TasNet) extended with our X-scheme, respectively called X-UMX, X-D3Net and X-Conv-TasNet, by comparing them with their original versions. We also verified the effectiveness of X-scheme in a large-scale data regime, showing its generality with respect to data size. X-UMX Large (X-UMXL), which was trained on large-scale internal data and used in our experiments, is newly available at https://github.com/asteroid-team/asteroid/tree/master/egs/musdb18/X-UMX.

Citations (4)

Summary

  • The paper presents the X-scheme that combines multi-domain loss, bridging operations, and combination loss to leverage both time- and frequency-domain audio signals.
  • It demonstrates improved performance with higher SDR and SIR metrics on models like Open-Unmix and D3Net without increasing the parameter count.
  • The approach offers strong scalability and generalization, making it a promising method for advancing real-time DNN-based music source separation.

Overview of "The Whole Is Greater than the Sum of Its Parts: Improving DNN-based Music Source Separation"

This paper presents a methodology, termed the X-scheme, to enhance the capabilities of Deep Neural Network (DNN)-based music source separation (MSS). The approach involves three key components: Multi-Domain Loss (MDL), bridging operation, and combination loss (CL).

Key Components

  1. Multi-Domain Loss (MDL): MDL enables the network to leverage both frequency- and time-domain representations of audio signals by appending an additional differentiable STFT or inverse STFT layer during training. This allows for loss calculations in both domains, thereby enhancing the model's performance without affecting the inference step.
  2. Bridging Operation: This operation introduces paths that connect individual instrument networks, facilitating interaction and information sharing among them. Importantly, this is achieved without increasing the learnable parameters, thus maintaining the efficiency of the network.
  3. Combination Loss (CL): CL computes loss functions based on combinations of output sources, helping networks learn the correlations between different instrument estimations. This enables a more effective separation process by considering interactions among sources during training.

Experimental Results

The X-scheme was tested on variants of existing MSS models, specifically Open-Unmix (UMX) and D3Net, yielding X-UMX and X-D3Net, respectively. Experiments demonstrated notable improvements:

  • Performance Improvements: The X-UMX and X-D3Net models showed superior performance compared to their original versions, validated through metrics like Signal-to-Distortion Ratio (SDR) and Source-to-Interference Ratio (SIR).
  • Scalability and Generalization: The X-scheme proved effective across different data scales. X-UMXL, trained on a larger internal dataset, significantly outperformed baseline models, indicating the scheme's robustness to data size variations.

Implications and Future Work

The X-scheme offers a significant enhancement to existing DNN-based MSS techniques while maintaining computational efficiency. The ability to improve performance without altering the inference stage is particularly advantageous for practical applications. Future research could expand the application of X-scheme to other MSS architectures and explore its potential in real-time MSS systems.

Conclusion

By introducing components that allow for the integration and cross-utilization of source information, the X-scheme represents an incremental improvement in the domain of music source separation. Its adaptability to different network types and training sizes makes it a promising approach for further advancements in auditory signal processing technologies.

X Twitter Logo Streamline Icon: https://streamlinehq.com