Papers
Topics
Authors
Recent
2000 character limit reached

X-UMX Models for Music Separation

Updated 28 November 2025
  • X-UMX based models are advanced deep neural architectures for music source separation that integrate cross-target feature sharing and multi-domain loss optimization.
  • They leverage bridging operations and combination losses to reduce correlated errors, achieving up to a +0.46 dB improvement in average SDR on benchmarks like MUSDB18.
  • Variants such as xumx-sliCQ and X-UMXL highlight the framework's adaptability and scalability, though non-STFT adaptations may face challenges with time-frequency representation.

X-UMX based models constitute a class of deep neural architectures for music source separation that extend or adapt the Open-Unmix (UMX) family by incorporating explicit cross-target information sharing, multi-domain objectives, or advanced training paradigms. These models are designed to address limitations of conventional independent-branch mask-based separators by enabling branches to share feature-level context, leveraging both spectrogram and time-domain signal representations, and enforcing disentanglement through losses that penalize correlated errors across sources. The X-UMX framework encompasses the original CrossNet-Open-Unmix (X-UMX), its numerous variants and training recipes, and several application-driven or exploratory models integrating cross-module operations.

1. Core Concepts: X-UMX Architecture and the X-Scheme

The baseline Open-Unmix (UMX) architecture predicts source-wise masks in the STFT domain using independent deep sub-networks, one per source (bass, drums, other, vocals). Each branch consists of affine blocks, a bidirectional LSTM (BLSTM), and a linear layer producing a time-frequency mask. The X-UMX paradigm modifies this through three principal innovations (Sawata et al., 2023, Sawata et al., 2020):

  1. Bridging Operations (Cross-Connections): At selected feature points (typically after affine blocks, before BLSTM), the feature maps from all source branches are averaged and injected as a bias into each branch:

h~(j)=h(j)+1Jk=1Jh(k)\tilde h^{(j)} = h^{(j)} + \frac{1}{J}\sum_{k=1}^{J} h^{(k)}

where h(j)h^{(j)} is the feature map for source jj and JJ is the number of sources.

  1. Multi-Domain Loss (MDL): During training, both frequency-domain magnitude MSE and time-domain weighted-SDR (wSDR) losses are combined:

LMDL=LMSE+αLwSDR\mathcal{L}_{\text{MDL}} = \mathcal{L}_{\text{MSE}} + \alpha \, \mathcal{L}_{\text{wSDR}}

with α\alpha typically set to 10. The wSDR term incorporates a differentiable ISTFT on the predicted spectrograms.

  1. Combination Loss (CL): To penalize correlated errors, MDL is also computed over all nontrivial subsets of sources, enforcing correct separation for every possible source combination.

These enhancements are implemented entirely in the training loop, do not introduce trainable parameters, and maintain the original inference speed and model size (Sawata et al., 2023).

2. Summary of X-UMX Variants and Extensions

Several derivatives and adaptations of X-UMX exist, including:

  • xumx-sliCQ:

Integrates the invertible sliCQT (nonstationary Gabor transform, variable-Q filterbank) in place of the STFT front-end and back-end. The network is organized using per-group CDAEs (convolutional denoising autoencoders) with learned de-overlap transposed-conv layers, but omits the full cross-branch BLSTM module. Performance is lower than baseline X-UMX, likely due to spectral raggedness and the absence of full cross-connections (Hanssian, 2021).

  • Modified X-UMX in Danna-Sep:

Retains the U-Net+BLSTM mask estimation pipeline but changes the loss to complex-domain MSE and incorporates a differentiable multichannel Wiener filter (MWF) as a final layer. Used in an ensemble, it demonstrates further gains when fused with waveform-based separators (Yu et al., 2021).

  • X-UMX L/UMXL:

Large-scale training with thousands of hours of data, retaining architectural and loss innovations, shows continued improvements and scalability (Sawata et al., 2023).

3. Algorithmic Details and Loss Formulations

The central methodological departure of X-UMX is the bridging operation, which injects the cross-source feature mean twice (in the canonical variant): after the shared pre-BLSTM encoder and again after the BLSTM layer. The combination loss is explicitly given by:

Ltotal=1Jj=1JLMDL{j}+1NS{1,,J}LMDLS\mathcal{L}_{\mathrm{total}} = \frac{1}{J}\sum_{j=1}^J\mathcal{L}_{\mathrm{MDL}}^{\{j\}} + \frac{1}{N}\sum_{S\subsetneq \{1,\dots,J\}} \mathcal{L}_{\mathrm{MDL}}^{S}

where N=i=1J1(Ji)N=\sum_{i=1}^{J-1} \binom{J}{i}.

The loss for an individual source is:

LMDL{j}=t=1Tf=1FYj(t,f)Y^j(t,f)2LMSE+α[ρjyjy^jyjy^j(1ρj)(xyj)(xy^j)xyjxy^j]LwSDR\mathcal{L}_{\mathrm{MDL}}^{\{j\}} = \underbrace{\sum_{t=1}^{T}\sum_{f=1}^{F} \left|\lvert Y_j(t,f)\rvert - \lvert\hat Y_j(t,f)\rvert\right|^2}_{\mathcal{L}_{\text{MSE}}} + \alpha \underbrace{\left[-\rho_j\,\frac{y_j^\top\,\hat y_j}{\lVert y_j\rVert\,\lVert\hat y_j\rVert} - (1-\rho_j)\frac{(x-y_j)^\top(x-\hat y_j)}{\lVert x-y_j\rVert\,\lVert x-\hat y_j\rVert}\right]}_{\mathcal{L}_{\mathrm{wSDR}}}

with ρj=yj2/[yj2+xyj2]\rho_j = \|y_j\|^2/[\|y_j\|^2+\|x-y_j\|^2].

Notably, these loss augmentations are applied during training only; test-time inference reverts to the original feedforward mask estimation and signal reconstruction.

4. Quantitative Performance and Empirical Outcomes

On the MUSDB18 dataset, X-UMX improves over the baseline UMX consistently across all sources and evaluation settings. Representative metrics (Sawata et al., 2023, Sawata et al., 2020):

Model Bass Drums Other Vocals Avg SDR (dB)
UMX 5.23 5.73 4.02 6.32 5.33
X-UMX 5.43 6.47 4.64 6.61 5.79
UMXL 5.79 6.93 4.50 6.71 5.98
X-UMXL 6.28 7.39 4.83 7.57 6.52

The full X-UMX (MDL+CL+Bridging) gives a +0.46 dB SDR average improvement over UMX on MUSDB18 and +0.54 dB on large-scale internal data. SAR improvements due to CL average +0.14 dB (Sawata et al., 2023, Sawata et al., 2020).

Ablations demonstrate that while each feature (MDL, CL, Bridging) improves separation quality, combined application yields the highest scores. For example, bridging alone increases SDR by up to +0.37 dB over MDL or CL individually (Sawata et al., 2020).

Specialized variants such as xumx-sliCQ underperform the STFT-based X-UMX, with median SDR dropping about 2 dB relative to X-UMX, attributing the loss to the challenges of ragged time-frequency representations and lack of cross-target information (Hanssian, 2021).

5. Implementation, Training Protocol, and Practical Considerations

X-UMX models are distributed as part of the Asteroid toolkit, supporting reproducibility and extensibility. Key protocol details include (Sawata et al., 2023):

  • Dataset: MUSDB18 or large-scale internal multitrack datasets.
  • Input preprocessing: STFT with 4096-point Hann window, 75% overlap.
  • Input to each branch: mixture magnitude spectrogram.
  • Batch size: up to 14 (MUSDB18) or 28 (large-scale).
  • Optimizer: Adam, weight decay 1e-5, with learning-rate scheduling.
  • Losses: frequency-domain MSE, time-domain wSDR, and CL; α10\alpha \approx 10.
  • Dropout is used in BLSTM layers; 2\ell_2 weight decay for regularization.
  • Early stopping is based on validation loss stability.

Bridging operations require minimal computational overhead (vector additions only) and do not affect inference cost or model size. All multi-domain and combination loss operations are strictly training-time augmentations.

6. Limitations, Non-STFT Variants, and Future Directions

xumx-sliCQ illustrates that naively replacing STFT front-ends/back-ends with advanced variable-Q filterbanks can degrade separation if the base architecture does not facilitate effective cross-band or cross-target information flow. The time-frequency raggedness, loss of receptive field uniformity, and absence of cross-branch features are primary causes. Suggested remedial directions include designing attention or transformer models that span bands and time, joint ragged-to-unified encoders, learnable filterbanks, hybrid parallel STFT+sliCQT feature paths, and deeper inversion networks for improved coefficient synthesis (Hanssian, 2021).

A plausible implication is that multi-resolution representations demand architectural innovations that preserve global musical structure awareness and cross-target context, reinforcing the design motivation for X-UMX and related bridging schemes.

7. Generality and Transferability

The X-scheme is model-agnostic and applies to any DNN separator with per-target estimating branches. MDL and CL extend directly to alternative network families (e.g., D3Net, Conv-TasNet) and large-data training regimes, consistently conferring 0.3–0.5 dB average SDR improvements without increased inference cost (Sawata et al., 2023). Furthermore, bridging operations are highly effective for architectures with per-target subnetworks but are redundant for fully joint architectures where information sharing is implicit.

In summary, X-UMX based models provide a robust, scalable paradigm for DNN-based music source separation by enforcing cross-target context, multi-domain optimization, and disentanglement, setting state-of-the-art benchmarks across a range of datasets and experimental configurations. Performance improvements are achieved via architectural and training loss innovations, without additional model complexity at inference (Sawata et al., 2023, Sawata et al., 2020).

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to X-UMX Based Models.