AnyAccomp: Robust Singing Accompaniment Framework
- AnyAccomp is a framework for singing accompaniment generation that decouples artifact cues from the vocal melody using a quantized, timbre-invariant bottleneck.
- It leverages a VQ-VAE to extract robust melodic tokens and a conditional flow-matching transformer to generate high-quality instrumental accompaniments.
- The approach achieves improved Accompaniment Prompt Adherence and audio fidelity across clean vocals and solo instrumental tracks.
AnyAccomp is a framework for Singing Accompaniment Generation (SAG) that addresses the critical generalization failures of prior SAG models by disentangling the accompaniment generation process from source-dependent artifacts. Unlike standard approaches, which are trained on vocals separated via Music Source Separation (MSS) and subsequently overfit to MSS-specific artifacts, AnyAccomp employs a quantized, timbre-invariant melodic bottleneck followed by a conditional flow-matching model, enabling robust accompaniment generation for clean vocals and solo instrumental tracks. This methodology demonstrates a significant advance in generalizability, achieving strong results across both artifact-rich and artifact-free inputs (Zhang et al., 17 Sep 2025).
1. Problem Formulation and Limitations of Prior Art
Singing Accompaniment Generation (SAG) seeks to generate a full-band instrumental accompaniment given a reference vocal melody (clean or separated), typically by mapping a mel-spectrogram input to an output waveform. This task underpins creative tools such as rapid prototyping and interactive music co-creation for both producers and amateurs.
State-of-the-art SAG systems, including SingSong and FastSAG, rely on vocals extracted by MSS for training. Mel-spectrogram or self-supervised learning (SSL) based vocal embeddings retain source-separation artifacts. This induces a pronounced train–test mismatch: models learn to depend on artifact cues present in separated vocals, but not in clean, studio-recorded vocals or solo instrument recordings. The result is a collapse in prompt adherence—reflected in the Accompaniment Prompt Adherence (APA) metric dropping to zero—on clean or instrumental test data.
2. Quantized Melodic Bottleneck
AnyAccomp introduces a two-stage pipeline, the first of which extracts a discrete, timbre-invariant, melody-centric representation using a chromagram and a Vector Quantized Variational Autoencoder (VQ-VAE).
2.1 Chromagram Extraction
- The input waveform is resampled to 24 kHz.
- Short-Time Fourier Transform (STFT) is computed with a 1024-sample window and 256-sample hop.
- The magnitude spectrum is projected onto 24 pitch classes (C, C♯, …, B), generating a dense chromagram at 50 Hz frame rate.
2.2 VQ-VAE Architecture
- The encoder is a 44M-parameter convolutional network, mapping with .
- Codebook with learned embeddings .
- For each frame , quantization is performed:
- Decoder: symmetric convolutional network reconstructs 0.
2.3 Bottleneck Losses
- Reconstruction loss:
1
- Commitment loss:
2
(3: stop-gradient.)
- Codebook updated via exponential moving averages [Van Den Oord et al., 2017].
- Total bottleneck loss:
4
3. Flow-Matching Accompaniment Generation
Upon obtaining discrete melodic tokens 5, AnyAccomp conditions an FM-Transformer on these codes to generate the accompaniment.
3.1 Flow Matching Formulation
- Output: target accompaniment mel-spectrogram 6
- Prior 7
- Linear interpolation:
8
- Velocity:
9
3.2 FM-Transformer Model
- Model 0 predicts 1 from noisy latent 2, time 3, and melodic codes 4.
- Architecture: 10 LLaMA-style decoder layers, hidden dimension 5, totaling 6M parameters.
- Conditioning: codes 7 embedded via learnable lookup (size 51281024) and prepended as tokens.
3.3 Training Objectives
- Flow-matching MSE:
9
- Representation Alignment (REPA) loss [Yu et al., 2025]:
0
(matches layer-1 activations 2 of FM-Transformer to the 3-th layer of a pre-trained MERT model.)
- Total generation loss:
4
- Classifier-free guidance: drop 5 with 6 during training; inference at CFG scale=3.
4. Training Regime and Evaluation Protocol
4.1 Datasets
- Training: 78,000 h of singing–accompaniment pairs from the SingNet pipeline (MSS-based separation, 3–30 s clips).
- Evaluation:
- YuE (3,000 separated vocals; in-domain)
- MUSDB18 (2,777 clean vocal stems)
- MoisesDB (2,500 solo instrumental tracks)
- (All test material: 10 s clips, 24 kHz.)
4.2 Preprocessing and Hyperparameters
- Audio resampled to 24 kHz.
- Compute 80-bin mel-spectrograms (accompaniment target) and 24-bin chromagrams (conditioning).
- VQ-VAE: 0.5M steps, batch size = 200 s (aggregate), AdamW, lr=8, no warmup.
- FM-Transformer: 1M steps, per-GPU batch = 100 s, AdamW, lr=9, 32k warmup, CFG 0.
- Inference: 50-step Euler sampler, CFG=3.
- Vocoder: fine-tuned from Vevo on project music data.
5. Empirical Results
5.1 Objective Evaluation
Metrics: Accompaniment Prompt Adherence (APA↑), Fréchet Audio Distance (FAD↓), audiobox-aesthetics scores (CE↑, CU↑, PQ↑), and Production Complexity (PC).
| Model | APA | FAD | CE | CU | PQ | PC |
|---|---|---|---|---|---|---|
| YuE | ||||||
| FastSAG | 0.444 | 0.598 | 6.351 | 6.821 | 6.814 | 6.321 |
| FM-Mel | 0.806 | 0.416 | 6.964 | 7.725 | 7.758 | 5.614 |
| FM-Chroma | 0.633 | 0.418 | 7.151 | 7.801 | 7.909 | 5.436 |
| AnyAccomp | 0.713 | 0.414 | 7.283 | 7.903 | 7.989 | 5.742 |
| MUSDB18 | ||||||
| FastSAG | 0.000 | 1.115 | 4.853 | 5.789 | 6.315 | 5.778 |
| FM-Mel | 0.167 | 0.999 | 5.202 | 6.616 | 6.841 | 4.090 |
| FM-Chroma | 0.704 | 0.798 | 7.017 | 7.598 | 7.744 | 5.104 |
| AnyAccomp | 0.710 | 0.788 | 7.277 | 7.804 | 7.891 | 5.498 |
| MoisesDB | ||||||
| FastSAG | 0.000 | 0.904 | 5.966 | 6.507 | 6.696 | 5.952 |
| FM-Mel | 0.000 | 0.936 | 5.424 | 6.923 | 7.151 | 3.804 |
| FM-Chroma | 0.157 | 0.849 | 6.308 | 7.377 | 7.508 | 4.110 |
| AnyAccomp | 0.203 | 0.890 | 6.660 | 7.581 | 7.581 | 4.798 |
On clean and instrumental data (MUSDB18, MoisesDB), AnyAccomp maintains high APA and audiobox-aesthetics scores, where baselines collapse (APA10).
5.2 Subjective Evaluation
Mean Opinion Scores (MOS) (scale: 1–5; 20 listeners, 20 clips each):
| Dataset | Model | Quality↑ | Coherency↑ |
|---|---|---|---|
| YuE | FastSAG | 1.98 | 1.82 |
| AnyAccomp | 3.12 | 3.05 | |
| MUSDB18 | FastSAG | 1.73 | 1.48 |
| AnyAccomp | 3.23 | 2.75 | |
| MoisesDB | FastSAG | 1.62 | 1.52 |
| AnyAccomp | 3.00 | 2.70 |
5.3 Ablation and Analysis
- FM-Mel (white-noised mel) and FM-Chroma (raw chroma) baselines isolate the VQ bottleneck's effect.
- FM-Chroma recovers some generalization, but only VQ-Chroma (AnyAccomp) fully prevents overfitting.
- Removing REPA loss degrades FAD by 20.05 and APA by 30.02, demonstrating the importance of representation alignment.
6. Limitations and Future Directions
Limitations include bottleneck resolution (50 Hz, 24 bins), which may miss fine pitch ornaments, and lack of explicit harmonic modeling beyond the melody. Computational demands are pronounced: a 220M-parameter transformer and iterative flow-matching with 50 sampling steps.
Potential future work includes adjusting bottleneck parameters for a trade-off between representation fidelity and compactness, exploring alternative front-ends such as Constant-Q Transform or pitch-contour encoders, enriching conditioning with chord/harmony or rhythm descriptors, and accelerating generation via distillation or fewer flow steps.
By decoupling accompaniment generation from separation artifacts, AnyAccomp is the first SAG model to generalize robustly to clean vocal and solo instrumental inputs, enabling universal co-creation tools that respond flexibly to a range of musical prompts (Zhang et al., 17 Sep 2025).