The Whole Is Greater than the Sum of Its Parts: Improving Music Source Separation by Bridging Network

Published 13 May 2023 in eess.AS and cs.SD | (2305.07855v2)

Abstract: This paper presents the crossing scheme (X-scheme) for improving the performance of deep neural network (DNN)-based music source separation (MSS) with almost no increasing calculation cost. It consists of three components: (i) multi-domain loss (MDL), (ii) bridging operation, which couples the individual instrument networks, and (iii) combination loss (CL). MDL enables the taking advantage of the frequency- and time-domain representations of audio signals. We modify the target network, i.e., the network architecture of the original DNN-based MSS, by adding bridging paths for each output instrument to share their information. MDL is then applied to the combinations of the output sources as well as each independent source; hence, we called it CL. MDL and CL can easily be applied to many DNN-based separation methods as they are merely loss functions that are only used during training and do not affect the inference step. Bridging operation does not increase the number of learnable parameters in the network. Experimental results showed that the validity of Open-Unmix (UMX), densely connected dilated DenseNet (D3Net) and convolutional time-domain audio separation network (Conv-TasNet) extended with our X-scheme, respectively called X-UMX, X-D3Net and X-Conv-TasNet, by comparing them with their original versions. We also verified the effectiveness of X-scheme in a large-scale data regime, showing its generality with respect to data size. X-UMX Large (X-UMXL), which was trained on large-scale internal data and used in our experiments, is newly available at https://github.com/asteroid-team/asteroid/tree/master/egs/musdb18/X-UMX.

Abstract PDF Upgrade to Chat

Citations (4)

View on Semantic Scholar

Summary

The paper introduces the X-Scheme, integrating multi-domain loss to leverage both time and frequency representations for improved music source separation.
The method employs a bridging operation and combination loss to enable inter-network communication, effectively addressing mutual interferences among audio sources.
Empirical validation on X-UMX and X-D3Net shows enhanced performance and scalability across diverse datasets without increasing computational overhead.

Improving Music Source Separation with the X-Scheme

Introduction

This paper introduces the X-scheme, a methodology designed to enhance deep neural network (DNN)-based music source separation (MSS) without increasing computational requirements. The X-scheme is composed of three core components: Multi-Domain Loss (MDL), a bridging operation, and combination loss (CL). MDL leverages both frequency- and time-domain audio signal representations, while the bridging operation interconnects networks dedicated to separation of individual instruments. CL applies MDL to both combinations of output sources and each independent source. These approaches aim to address two key challenges: the limited domain representation typically utilized by MSS networks and insufficient accounting for mutual effects among output sources.

Multi-Domain Loss

MDL is achieved by appending differentiable short-time Fourier transform (STFT) or inverse STFT (ISTFT) layers to the target network during training. This enables computation of loss functions in both time and frequency domains, leveraging MSE and weighted signal-to-distortion ratio (wSDR) metrics respectively. The scaling parameter $\alpha$ is optimized to balance these loss functions. By permitting networks to learn from dual-domain representations, MDL augments separation quality without impacting inference efficiency.

Bridging Operation and Combination Loss

The bridging operation enables network units for individual instruments to communicate, thereby encapsulating the mutual relationships among these units. This is particularly crucial in cases where the separation process for one source might impact that of others. Moreover, combination loss (CL) facilitates multi-source interaction by calculating losses for various combinations of output sources rather than treating each source independently, thus promoting integrated learning across the network.

Application and Validation

The X-scheme was applied to and validated on two architectures: Open-Unmix (UMX) and Densely connected dilated DenseNet (D3Net), producing enhanced versions termed X-UMX and X-D3Net. In empirical testing, X-UMX and X-D3Net outperformed their equivalents when trained with or without the bridging paths, even in settings with significant data scale variability.

Scalability and Extension

The robustness of the X-scheme extends to larger datasets, demonstrated when X-UMX trained on a more extensive dataset significantly surpassed even datasets five times larger, showing strong capability in a variety of training scenarios. Bridging operations, especially when inserted between semantic blocks—such as fully connected layers and long short-term memory (LSTM) modules—were found effective, yielding improved source-to-distortion ratios.

Conclusion

The X-scheme epitomizes an efficient strategy to elevate DNN-based MSS. By integrating MDL, bridging operations, and CL, the scheme enhances overall source separation performance while maintaining the computational efficiency of the foundational framework. This approach is broadly applicable to various architectures, enabling more nuanced handling of both time and frequency domains across divergent datasets. The practical implications suggest promising enhancements in audio processing applications, with potential future extensions toward other complex multi-task environments.

Markdown