AnyAccomp: Robust Singing Accompaniment Framework

Updated 16 April 2026

AnyAccomp is a framework for singing accompaniment generation that decouples artifact cues from the vocal melody using a quantized, timbre-invariant bottleneck.
It leverages a VQ-VAE to extract robust melodic tokens and a conditional flow-matching transformer to generate high-quality instrumental accompaniments.
The approach achieves improved Accompaniment Prompt Adherence and audio fidelity across clean vocals and solo instrumental tracks.

AnyAccomp is a framework for Singing Accompaniment Generation (SAG) that addresses the critical generalization failures of prior SAG models by disentangling the accompaniment generation process from source-dependent artifacts. Unlike standard approaches, which are trained on vocals separated via Music Source Separation (MSS) and subsequently overfit to MSS-specific artifacts, AnyAccomp employs a quantized, timbre-invariant melodic bottleneck followed by a conditional flow-matching model, enabling robust accompaniment generation for clean vocals and solo instrumental tracks. This methodology demonstrates a significant advance in generalizability, achieving strong results across both artifact-rich and artifact-free inputs (Zhang et al., 17 Sep 2025).

1. Problem Formulation and Limitations of Prior Art

Singing Accompaniment Generation (SAG) seeks to generate a full-band instrumental accompaniment given a reference vocal melody (clean or separated), typically by mapping a mel-spectrogram input to an output waveform. This task underpins creative tools such as rapid prototyping and interactive music co-creation for both producers and amateurs.

State-of-the-art SAG systems, including SingSong and FastSAG, rely on vocals extracted by MSS for training. Mel-spectrogram or self-supervised learning (SSL) based vocal embeddings retain source-separation artifacts. This induces a pronounced train–test mismatch: models learn to depend on artifact cues present in separated vocals, but not in clean, studio-recorded vocals or solo instrument recordings. The result is a collapse in prompt adherence—reflected in the Accompaniment Prompt Adherence (APA) metric dropping to zero—on clean or instrumental test data.

2. Quantized Melodic Bottleneck

AnyAccomp introduces a two-stage pipeline, the first of which extracts a discrete, timbre-invariant, melody-centric representation using a chromagram and a Vector Quantized Variational Autoencoder (VQ-VAE).

2.1 Chromagram Extraction

The input waveform is resampled to 24 kHz.
Short-Time Fourier Transform (STFT) is computed with a 1024-sample window and 256-sample hop.
The magnitude spectrum is projected onto 24 pitch classes (C, C♯, …, B), generating a dense chromagram $x \in \mathbb{R}^{T \times 24}$ at 50 Hz frame rate.

2.2 VQ-VAE Architecture

The encoder $z_e(\cdot)$ is a 44M-parameter convolutional network, mapping $x \in \mathbb{R}^{T \times 24} \to z_e \in \mathbb{R}^{T \times D}$ with $D=64$ .
Codebook $E \in \mathbb{R}^{K \times D}$ with $K=512$ learned embeddings $e_k$ .
For each frame $t$ , quantization is performed:

$k^* = \underset{1 \leq k \leq K}{\mathrm{argmin}}\, \| z_e(x)_t - e_k \|_2$

$z_q(x)_t = e_{k^*}$

Decoder: symmetric convolutional network reconstructs $z_e(\cdot)$ 0.

2.3 Bottleneck Losses

Reconstruction loss:

$z_e(\cdot)$ 1

Commitment loss:

$z_e(\cdot)$ 2

( $z_e(\cdot)$ 3: stop-gradient.)

Codebook updated via exponential moving averages [Van Den Oord et al., 2017].
Total bottleneck loss:

$z_e(\cdot)$ 4

3. Flow-Matching Accompaniment Generation

Upon obtaining discrete melodic tokens $z_e(\cdot)$ 5, AnyAccomp conditions an FM-Transformer on these codes to generate the accompaniment.

3.1 Flow Matching Formulation

Output: target accompaniment mel-spectrogram $z_e(\cdot)$ 6
Prior $z_e(\cdot)$ 7
Linear interpolation:

$z_e(\cdot)$ 8

Velocity:

$z_e(\cdot)$ 9

3.2 FM-Transformer Model

Model $x \in \mathbb{R}^{T \times 24} \to z_e \in \mathbb{R}^{T \times D}$ 0 predicts $x \in \mathbb{R}^{T \times 24} \to z_e \in \mathbb{R}^{T \times D}$ 1 from noisy latent $x \in \mathbb{R}^{T \times 24} \to z_e \in \mathbb{R}^{T \times D}$ 2, time $x \in \mathbb{R}^{T \times 24} \to z_e \in \mathbb{R}^{T \times D}$ 3, and melodic codes $x \in \mathbb{R}^{T \times 24} \to z_e \in \mathbb{R}^{T \times D}$ 4.
Architecture: 10 LLaMA-style decoder layers, hidden dimension $x \in \mathbb{R}^{T \times 24} \to z_e \in \mathbb{R}^{T \times D}$ 5, totaling $x \in \mathbb{R}^{T \times 24} \to z_e \in \mathbb{R}^{T \times D}$ 6M parameters.
Conditioning: codes $x \in \mathbb{R}^{T \times 24} \to z_e \in \mathbb{R}^{T \times D}$ 7 embedded via learnable lookup (size 512 $x \in \mathbb{R}^{T \times 24} \to z_e \in \mathbb{R}^{T \times D}$ 81024) and prepended as tokens.

3.3 Training Objectives

Flow-matching MSE:

$x \in \mathbb{R}^{T \times 24} \to z_e \in \mathbb{R}^{T \times D}$ 9

Representation Alignment (REPA) loss [Yu et al., 2025]:

$D=64$ 0

(matches layer- $D=64$ 1 activations $D=64$ 2 of FM-Transformer to the $D=64$ 3-th layer of a pre-trained MERT model.)

Total generation loss:

$D=64$ 4

Classifier-free guidance: drop $D=64$ 5 with $D=64$ 6 during training; inference at CFG scale=3.

4. Training Regime and Evaluation Protocol

4.1 Datasets

Training: $D=64$ 78,000 h of singing–accompaniment pairs from the SingNet pipeline (MSS-based separation, 3–30 s clips).
Evaluation:
- YuE (3,000 separated vocals; in-domain)
- MUSDB18 (2,777 clean vocal stems)
- MoisesDB (2,500 solo instrumental tracks)
- (All test material: 10 s clips, 24 kHz.)

4.2 Preprocessing and Hyperparameters

Audio resampled to 24 kHz.
Compute 80-bin mel-spectrograms (accompaniment target) and 24-bin chromagrams (conditioning).
VQ-VAE: 0.5M steps, batch size = 200 s (aggregate), AdamW, lr= $D=64$ 8, no warmup.
FM-Transformer: 1M steps, per-GPU batch = 100 s, AdamW, lr= $D=64$ 9, 32k warmup, CFG $E \in \mathbb{R}^{K \times D}$ 0.
Inference: 50-step Euler sampler, CFG=3.
Vocoder: fine-tuned from Vevo on project music data.

5. Empirical Results

5.1 Objective Evaluation

Metrics: Accompaniment Prompt Adherence (APA↑), Fréchet Audio Distance (FAD↓), audiobox-aesthetics scores (CE↑, CU↑, PQ↑), and Production Complexity (PC).

Model	APA	FAD	CE	CU	PQ	PC
YuE
FastSAG	0.444	0.598	6.351	6.821	6.814	6.321
FM-Mel	0.806	0.416	6.964	7.725	7.758	5.614
FM-Chroma	0.633	0.418	7.151	7.801	7.909	5.436
AnyAccomp	0.713	0.414	7.283	7.903	7.989	5.742
MUSDB18
FastSAG	0.000	1.115	4.853	5.789	6.315	5.778
FM-Mel	0.167	0.999	5.202	6.616	6.841	4.090
FM-Chroma	0.704	0.798	7.017	7.598	7.744	5.104
AnyAccomp	0.710	0.788	7.277	7.804	7.891	5.498
MoisesDB
FastSAG	0.000	0.904	5.966	6.507	6.696	5.952
FM-Mel	0.000	0.936	5.424	6.923	7.151	3.804
FM-Chroma	0.157	0.849	6.308	7.377	7.508	4.110
AnyAccomp	0.203	0.890	6.660	7.581	7.581	4.798

On clean and instrumental data (MUSDB18, MoisesDB), AnyAccomp maintains high APA and audiobox-aesthetics scores, where baselines collapse (APA $E \in \mathbb{R}^{K \times D}$ 10).

5.2 Subjective Evaluation

Mean Opinion Scores (MOS) (scale: 1–5; 20 listeners, 20 clips each):

Dataset	Model	Quality↑	Coherency↑
YuE	FastSAG	1.98	1.82
	AnyAccomp	3.12	3.05
MUSDB18	FastSAG	1.73	1.48
	AnyAccomp	3.23	2.75
MoisesDB	FastSAG	1.62	1.52
	AnyAccomp	3.00	2.70

5.3 Ablation and Analysis

FM-Mel (white-noised mel) and FM-Chroma (raw chroma) baselines isolate the VQ bottleneck's effect.
FM-Chroma recovers some generalization, but only VQ-Chroma (AnyAccomp) fully prevents overfitting.
Removing REPA loss degrades FAD by $E \in \mathbb{R}^{K \times D}$ 20.05 and APA by $E \in \mathbb{R}^{K \times D}$ 30.02, demonstrating the importance of representation alignment.

6. Limitations and Future Directions

Limitations include bottleneck resolution (50 Hz, 24 bins), which may miss fine pitch ornaments, and lack of explicit harmonic modeling beyond the melody. Computational demands are pronounced: a 220M-parameter transformer and iterative flow-matching with 50 sampling steps.

Potential future work includes adjusting bottleneck parameters for a trade-off between representation fidelity and compactness, exploring alternative front-ends such as Constant-Q Transform or pitch-contour encoders, enriching conditioning with chord/harmony or rhythm descriptors, and accelerating generation via distillation or fewer flow steps.

By decoupling accompaniment generation from separation artifacts, AnyAccomp is the first SAG model to generalize robustly to clean vocal and solo instrumental inputs, enabling universal co-creation tools that respond flexibly to a range of musical prompts (Zhang et al., 17 Sep 2025).

Markdown Report Issue Upgrade to Chat

References (1)

AnyAccomp: Generalizable Accompaniment Generation via Quantized Melodic Bottleneck (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AnyAccomp.

AnyAccomp: Robust Singing Accompaniment Framework

1. Problem Formulation and Limitations of Prior Art

2. Quantized Melodic Bottleneck

2.1 Chromagram Extraction

2.2 VQ-VAE Architecture

2.3 Bottleneck Losses

3. Flow-Matching Accompaniment Generation

3.1 Flow Matching Formulation

3.2 FM-Transformer Model

3.3 Training Objectives

4. Training Regime and Evaluation Protocol

4.1 Datasets

4.2 Preprocessing and Hyperparameters

5. Empirical Results

5.1 Objective Evaluation

5.2 Subjective Evaluation

5.3 Ablation and Analysis

6. Limitations and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

AnyAccomp: Robust Singing Accompaniment Framework

1. Problem Formulation and Limitations of Prior Art

2. Quantized Melodic Bottleneck

2.1 Chromagram Extraction

2.2 VQ-VAE Architecture

2.3 Bottleneck Losses

3. Flow-Matching Accompaniment Generation

3.1 Flow Matching Formulation

3.2 FM-Transformer Model

3.3 Training Objectives

4. Training Regime and Evaluation Protocol

4.1 Datasets

4.2 Preprocessing and Hyperparameters

5. Empirical Results

5.1 Objective Evaluation

5.2 Subjective Evaluation

5.3 Ablation and Analysis

6. Limitations and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research