FastSAG: Diffusion-Based Singing Accompaniment

Updated 2 January 2026

FastSAG is a diffusion-based model that achieves a >30× speedup over autoregressive approaches while generating high-quality instrumental accompaniments.
It employs probability-flow ODEs and semantic conditioning via MERT-derived features to directly generate Mel spectrograms aligned with vocal inputs.
FastSAG shows significant improvements in FAD metrics and subjective MOS, making it a breakthrough for real-time and interactive music production applications.

FastSAG is a non-autoregressive, diffusion-based framework for singing accompaniment generation (SAG) that enables rapid, high-quality creation of instrumental accompaniments aligned with an input vocal track. Contrasting previous token-based autoregressive models such as SingSong, FastSAG generates Mel spectrograms directly and conditions the diffusion process on semantic features extracted from the vocal input, resulting in a significant speed-up and improved coherence between vocals and accompaniment. FastSAG achieves real-time throughput—over 30 times faster than state-of-the-art autoregressive approaches—while delivering both objective and subjective improvements in sample quality and semantic alignment (Chen et al., 2024).

1. Problem Definition and Motivation

Singing accompaniment generation aims to synthesize instrumental tracks conditioned on a given vocal input (waveform). This task is central to human–AI symbiotic art creation and automatic music production systems. Prior state-of-the-art approaches, particularly SingSong, rely on multi-stage autoregressive (AR) pipelines that recursively generate semantic and acoustic token representations before decoding to the final audio. These methods suffer from high latency: the two-stage SingSong variant yields a real-time factor (RTF) of approximately 10.5, while a three-stage variant degrades to RTF≈47.8, precluding their use in real-time or interactive applications (Chen et al., 2024).

2. Diffusion-Based Non-Autoregressive Modeling

FastSAG builds upon the Elucidating Diffusion Models (EDM) and score-based generative modeling, employing a variance-exploding stochastic differential equation (SDE) for denoising and sample generation. The model operates on continuous time $t\in[0,T]$ , adding noise via

$dx = f(x,t)\,dt + g(t)\,dw,$

with corresponding probability-flow ODE

$dx = -\dot{\delta}(t)\,\delta(t)\,\nabla_x \log p(x;\delta(t))\,dt,$

where $\delta(t)$ specifies the noise schedule. During training, the model selects noise level $\ln\delta \sim \mathcal{N}(P_\text{mean}=-1.2,\,P_\text{std}=1.2)$ ; at sampling, the noise schedule $\delta_i$ is discretized across $N=50$ steps with $\rho=7$ , $\delta_\text{min}=0.002$ , and $\delta_\text{max}=80$ .

The core denoiser $D_\theta(x_t, \text{cond})$ is realized as a 2D U-Net with time-dependent skip and output coefficients:

$D_\theta(x_t, \text{cond}) = c_\text{skip}(t) x_t + c_\text{out}(t) F_\theta(x_t, t, \text{cond}),$

where $c_\text{skip}(t) = \frac{\delta_\text{data}^2}{(t - \epsilon)^2 + \delta_\text{data}^2}$ and $c_\text{out}(t) = \frac{\delta_\text{data} (t - \epsilon)}{\sqrt{\delta_\text{data}^2 + t^2}}$ for $\delta_\text{data}=0.5$ , $\epsilon=0.002$ . The denoiser is conditioned on a frame-aligned prior derived from the vocal input (see Section 3).

At inference, the probability-flow ODE is integrated with a first-order solver, yielding the Mel spectrogram $Mel'_{nv} \in \mathbb{R}^{L_2 \times d_2}$ of the generated accompaniment in a highly efficient manner.

3. Semantic and Prior Conditioning Architecture

FastSAG eschews coarse acoustic token conditioning in favor of high-level semantic features extracted from the vocal input $A_v$ using a pre-trained MERT model:

Semantic projection: $A_v \in \mathbb{R}^{T \times 1}$ is mapped to $S_v \in \mathbb{R}^{L_1 \times d_1}$ .
A Wavenet-based encoder learns to predict accompaniment semantics: $S'_{nv} = \text{Wavenet}_\text{sem}(S_v)$ .
Prior projection: $S_v$ $S_{v}$ and $S'_{nv}$ $S_{n v}^{'}$ are fused and resampled to the Mel spectrogram shape $P_\text{prior} \in \mathbb{R}^{L_2 \times d_2}$ $P_{prior} \in R^{L_{2} \times d_{2}}$ via either:
1. Bilinear interpolation, or
2. Perceiver-IO (cross-attention, self-attention, cross-attention) using learnable input/output queries ( $N=32, D=256$ ).
A second Wavenet refines the resampled features to the final prior $P_\text{prior}$ , which conditions the diffusion U-Net.

Throughout, only semantic features derived from MERT are used as conditioning; coarse auto-encoder waveform or Mel features, utilized by SingSong, are not employed. Mel normalization to $[-1, 1]$ is applied and empirically improves all FAD metrics.

4. Loss Functions and Training Objectives

The complete objective jointly trains the semantic/prior projections and the diffusion denoiser using $L_2$ reconstruction losses:

Semantic coherence: $L_\text{sem} = \|S'_{nv} - S_{nv}\|^2_2$ , where $S_{nv}$ is the MERT embedding of the ground-truth accompaniment.
Prior (rhythm) alignment: $L_\text{pr} = \|P_\text{prior} - Mel_{nv}\|^2_2$ , with $Mel_{nv}$ as the real accompaniment Mel.
Diffusion reconstruction: $L_\text{diff} = \|D_\theta(\text{Noise}(Mel_{nv}), \ln \delta, P_\text{prior}) - Mel_{nv}\|^2_2$ .

The total loss aggregates all three terms equally: $L = \lambda_s L_\text{sem} + \lambda_p L_\text{pr} + \lambda_d L_\text{diff}$ , with $\lambda_s = \lambda_p = \lambda_d = 1$ .

5. Quantitative Performance and Comparative Evaluation

Comprehensive evaluation on the MUSDB18 zero-shot testset yields:

Model	FAD $_\text{VGGish}$	FAD $_\text{MERT}$	FAD $_\text{CLAP-MUSIC}$	RTF	MOS (Harmony/Coherence)
SingSong (semantic+coarse)	0.8632	3.1589	0.0878	10.5	2.36
SingSong (three-stage)	—	—	—	47.8	—
FastSAG (bilinear interp)	0.7595	1.5059	0.0648	0.323	2.78
FastSAG (Perceiver-IO)	—	—	—	—	3.13
Human-composed	—	—	—	—	4.15

FastSAG reduces FAD $_\text{VGGish}$ by 12% and FAD $_\text{MERT}$ by 52% relative to SingSong, while achieving a >30× speedup (RTF≈0.323). The Perceiver-IO variant attains higher subjective MOS, indicating improved perceived alignment. Ablations further show that semantic-only conditioning outperforms any variant feeding raw Mel or mixing semantic and acoustic features.

6. Architectural Insights and Design Implications

FastSAG's architecture demonstrates that non-AR diffusion modeling, when conditioned on robust semantic representations of vocal inputs, circumvents the latency bottlenecks inherent in AR tokenization schemes. The two-stage design, partitioning the problem into "semantic→rough rhythm" via prior projection and "denoise→fine detail" via diffusion refinement, is both computationally efficient and essential for high audio quality. Comparative analysis indicates that bilinear interpolation offers marginally better FAD metrics, whereas Perceiver-IO yields superior human ratings—in line with prior observations on perceptual vs. instrumental model selection.

The exclusive use of MERT-derived semantic features for conditioning, alongside Mel normalization, is empirically superior to approaches blending lower-level audio features, reinforcing the critical role of high-level representations in conditional music generation.

7. Broader Impact and Future Directions

FastSAG establishes a new baseline for rapid, high-quality singing accompaniment generation suitable for real-time and interactive applications. Its architectural design suggests further research into advanced semantic conditioning, alternative attention-based prior projection mechanisms, and direct end-to-end waveform generation via diffusion. The substantial speedup over AR models, coupled with improved harmonic and rhythmic coherence, expands the practical feasibility of SAG systems in both artistic and commercial domains (Chen et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

FastSAG: Towards Fast Non-Autoregressive Singing Accompaniment Generation (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FastSAG.