QSCNet: Query-Based Audio Source Separation

Updated 18 December 2025

QSCNet is a query-conditioned deep learning architecture for audio source separation that enables extraction of arbitrary instruments via audio queries.
It employs banded downsampling/upsampling and dual-path BiLSTM modules to effectively capture both temporal and spectral features.
The architecture achieves superior SNR performance and parameter efficiency compared to traditional fixed-head music source separation models.

QSCNet designates a family of deep learning architectures for audio music source separation based on a conditioned UNet framework, integrating query-based instrument extraction with advanced network modules to achieve state-of-the-art separation performance on large-instrument-vocabulary datasets. Distinct from architectures enforcing a fixed stem taxonomy, QSCNet enables source extraction via audio queries, vastly generalizing the range of possible targets and achieving high efficiency with significantly reduced parameter count compared to prior methods (O'Hanlon et al., 17 Dec 2025).

1. Problem Motivation and Conditioning Paradigm

Traditional music source separation (MSS) models rely on either multi-output neural networks with fixed heads for a prespecified set of instrument stems (e.g., vocals, bass, drums, others), or one-network-per-instrument configurations. Both approaches strictly limit generalization to new or arbitrary instruments.

Conditioned MSS reframes the problem: the model receives a mixture signal $y$ and a corresponding audio query $Q_i$ associated with the desired instrument, and outputs the separated stem $s_i$ . This approach is formalized as: $\mathcal{N}(y, Q_i) \to s_i$ Conditioned models thus enable extraction of any instrument, instrument grouping, or sub-stem for which an example query can be provided. Initial progress in conditioned architectures was limited by the lack of suitable large-vocabulary datasets; this bottleneck was addressed by the MoisesDb dataset, comprising 11 high-level and 30 fine-grained stems.

Banquet, employing a Bandsplit RNN design, was the first system to exploit MoisesDb at scale, though it claimed that UNet architectures lack the expressive information flow for effective conditioning. QSCNet demonstrates that, with suitable architectural modifications, the conditioned UNet not only matches but surpasses Bandsplit RNNs in parameter efficiency and SNR (O'Hanlon et al., 17 Dec 2025).

2. Architectural Overview

QSCNet is structured as a UNet with banded encoder/decoder blocks and an intervening "neck" composed of dual-path bidirectional LSTM (BiLSTM) modules inherited from the Sparse Compressed Network (SCNet) backbone. Its primary features are:

Banded Downsampling/Upsampling: Incoming time-frequency data are split into low, mid, and high frequency bands, each processed and resampled at different rates, then concatenated. This supports hierarchical feature learning and respects spectral diversity across instrument timbres.
Dual-path LSTM Neck: The central latent representation is processed through alternating BiLSTM layers in both channel-time and channel-frequency domains, enabling joint temporal and spectral modeling.
Skip Connections: Standard UNet skip connections inject encoder outputs directly into corresponding decoder layers, allowing information preservation across scales.

A single Feature-wise Linear Modulation (FiLM) layer, positioned at the encoder terminus, enables efficient and effective conditioning by instrument query.

3. Query Embedding and Conditioning Mechanism

Conditioning is achieved via a FiLM modulation applied immediately prior to the dual-path LSTM neck:

Query Embedding: The instrument audio query is embedded via a pretrained PASST instrument-ID network, producing a $\mathbb{R}^{768}$ vector from a 10s query segment.
FiLM Affine Modulation: The query embedding is processed by two MLPs to output FiLM affine parameters $\gamma, \beta \in \mathbb{R}^{C_\text{enc}}$ , where $C_\text{enc}$ is the encoder output channel dimension:

$\gamma = W_\gamma\,\mathrm{ELU}(W_1\,\mathcal{Q}+b_1)+b_\gamma, \quad \beta = W_\beta\,\mathrm{ELU}(W_1\,\mathcal{Q}+b_1)+b_\beta$

These parameters scale and translate the encoded tensor along its channel axis:

$\mathbf{E}_{c, f, t} \longleftarrow \gamma_c\,\mathbf{E}_{c, f, t} + \beta_c$

This setup enables the network to adaptively extract features relevant to the query instrument at the latent representation stage.

4. Mathematical Formulation

The QSCNet forward pass is expressed as: $\begin{aligned} \mathbf{E} &= \mathcal{N}^{\rm Enc}(\mathbf{X}) \ \widetilde{\mathbf{E}} &= \mathrm{FiLM}(\mathcal{Q}, \mathbf{E}) \ \mathbf{N} &= \mathcal{N}^{\rm Neck}(\widetilde{\mathbf{E}}) \ \widehat{\mathbf{M}} &= \mathcal{N}^{\rm Dec}(\mathbf{N},\{\mathbf{e}_\ell\}_{\ell=1}^L) \ \widehat{\mathbf{s}} &= \mathrm{ISTFT}(\widehat{\mathbf{M}} \odot \mathbf{X}) \end{aligned}$ where $\mathbf{X}$ is the complex spectrogram of the mixture, $\widehat{\mathbf{M}}$ is the predicted time-frequency mask, and $\widehat{\mathbf{s}}$ is the separated stem.

The network is trained to minimize the spectrogram RMSE loss: $\mathcal{L}_{\rm RMSE} = \sqrt{\frac{1}{F\,T}\,\bigl\|\,|\widehat{\mathbf{M}}\odot\mathbf{X}| - |\mathbf{Y}|\bigr\|_F^2}$ where $\mathbf{Y}$ is the ground-truth stem spectrogram. Quality is evaluated using per-track SNR: $\mathrm{SNR}_i = 10\log_{10} \frac{\|\mathbf{y}_i\|_F^2}{\|\mathbf{y}_i - \widehat{\mathbf{s}}_i\|_F^2}$

5. Training Protocol and Dataset Regimes

QSCNet is trained using the MoisesDb dataset, with a 6-stem re-aggregation $\mathcal{I}^6 = \{\mathrm{vocals}, \mathrm{bass}, \mathrm{drums}, \mathrm{guitar}, \mathrm{piano}, \mathrm{others}\}$ . Training comprises:

10s stereo input clips, STFT (4096 window, 1024 hop, 44.1 kHz)
“Cacophony” data mixing: independent clips sampled for each stem per mixture
Data augmentations: channel/signal flips and random gain scaling
Per-batch random selection of queries with >20% nonzero energy
Adam optimizer (lr= $3\times10^{-4}$ ), EMA model selection, 300 epochs, batch size 8

6. Quantitative Results and Parameter Efficiency

Performance benchmarking demonstrates the parameter efficiency and separation performance of QSCNet compared to both conditioned and non-conditioned models. On the 6-stem MoisesDb task, QSCNet achieves superior SNR than the Banquet Bandsplit RNN with less than half the parameter count:

Model	Bass	Vocals	Drums	Guitar	Piano	Others	Avg₅	Params (M)
Banquet (BSRNN)	11.0	8.0	9.5	3.3	2.5	–	6.9	24.9
QSCNet (ours)	11.9	9.8	11.7	5.7	3.4	1.3	8.5	10.2
SCNet6 (uncond.)	12.8	10.5	12.4	6.3	4.0	2.8	9.2	26.6
SCNet6 (Large)	13.5	12.2	13.4	7.0	4.6	3.4	10.1	–

QSCNet also outperforms Banquet on the extended 6→10 stem task, further validating the architecture's scalability (O'Hanlon et al., 17 Dec 2025).

7. Discussion and Future Directions

QSCNet demonstrates that conditioned UNet architectures equipped with sparse compressed modules and a strategically situated FiLM layer can achieve both parameter efficiency and superior instrument separation, contradicting previous assertions regarding UNet limitations for the conditioned setting. The backbone's banded information path design and dual-path RNNs provide sufficient information flow to enable high-performance conditional separation.

Potential avenues for further advancement include ablations on FiLM location, end-to-end query embedding training, extension to the full 11–30 stem MoisesDb partition, and exploration of perceptual loss objectives or mask–phase decoupling. A plausible implication is that further architectural scaling and additional conditioning layers may close the remaining SNR gap with large multi-output models at continued parameter savings (O'Hanlon et al., 17 Dec 2025).

Markdown Upgrade to Chat

References (1)

A Conditioned UNet for Music Source Separation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to QSCNet.