Papers
Topics
Authors
Recent
Search
2000 character limit reached

AnyAccomp: Robust Singing Accompaniment Framework

Updated 16 April 2026
  • AnyAccomp is a framework for singing accompaniment generation that decouples artifact cues from the vocal melody using a quantized, timbre-invariant bottleneck.
  • It leverages a VQ-VAE to extract robust melodic tokens and a conditional flow-matching transformer to generate high-quality instrumental accompaniments.
  • The approach achieves improved Accompaniment Prompt Adherence and audio fidelity across clean vocals and solo instrumental tracks.

AnyAccomp is a framework for Singing Accompaniment Generation (SAG) that addresses the critical generalization failures of prior SAG models by disentangling the accompaniment generation process from source-dependent artifacts. Unlike standard approaches, which are trained on vocals separated via Music Source Separation (MSS) and subsequently overfit to MSS-specific artifacts, AnyAccomp employs a quantized, timbre-invariant melodic bottleneck followed by a conditional flow-matching model, enabling robust accompaniment generation for clean vocals and solo instrumental tracks. This methodology demonstrates a significant advance in generalizability, achieving strong results across both artifact-rich and artifact-free inputs (Zhang et al., 17 Sep 2025).

1. Problem Formulation and Limitations of Prior Art

Singing Accompaniment Generation (SAG) seeks to generate a full-band instrumental accompaniment given a reference vocal melody (clean or separated), typically by mapping a mel-spectrogram input to an output waveform. This task underpins creative tools such as rapid prototyping and interactive music co-creation for both producers and amateurs.

State-of-the-art SAG systems, including SingSong and FastSAG, rely on vocals extracted by MSS for training. Mel-spectrogram or self-supervised learning (SSL) based vocal embeddings retain source-separation artifacts. This induces a pronounced train–test mismatch: models learn to depend on artifact cues present in separated vocals, but not in clean, studio-recorded vocals or solo instrument recordings. The result is a collapse in prompt adherence—reflected in the Accompaniment Prompt Adherence (APA) metric dropping to zero—on clean or instrumental test data.

2. Quantized Melodic Bottleneck

AnyAccomp introduces a two-stage pipeline, the first of which extracts a discrete, timbre-invariant, melody-centric representation using a chromagram and a Vector Quantized Variational Autoencoder (VQ-VAE).

2.1 Chromagram Extraction

  • The input waveform is resampled to 24 kHz.
  • Short-Time Fourier Transform (STFT) is computed with a 1024-sample window and 256-sample hop.
  • The magnitude spectrum is projected onto 24 pitch classes (C, C♯, …, B), generating a dense chromagram xRT×24x \in \mathbb{R}^{T \times 24} at 50 Hz frame rate.

2.2 VQ-VAE Architecture

  • The encoder ze()z_e(\cdot) is a 44M-parameter convolutional network, mapping xRT×24zeRT×Dx \in \mathbb{R}^{T \times 24} \to z_e \in \mathbb{R}^{T \times D} with D=64D=64.
  • Codebook ERK×DE \in \mathbb{R}^{K \times D} with K=512K=512 learned embeddings eke_k.
  • For each frame tt, quantization is performed:

k=argmin1kKze(x)tek2k^* = \underset{1 \leq k \leq K}{\mathrm{argmin}}\, \| z_e(x)_t - e_k \|_2

zq(x)t=ekz_q(x)_t = e_{k^*}

  • Decoder: symmetric convolutional network reconstructs ze()z_e(\cdot)0.

2.3 Bottleneck Losses

  • Reconstruction loss:

ze()z_e(\cdot)1

  • Commitment loss:

ze()z_e(\cdot)2

(ze()z_e(\cdot)3: stop-gradient.)

ze()z_e(\cdot)4

3. Flow-Matching Accompaniment Generation

Upon obtaining discrete melodic tokens ze()z_e(\cdot)5, AnyAccomp conditions an FM-Transformer on these codes to generate the accompaniment.

3.1 Flow Matching Formulation

  • Output: target accompaniment mel-spectrogram ze()z_e(\cdot)6
  • Prior ze()z_e(\cdot)7
  • Linear interpolation:

ze()z_e(\cdot)8

  • Velocity:

ze()z_e(\cdot)9

3.2 FM-Transformer Model

  • Model xRT×24zeRT×Dx \in \mathbb{R}^{T \times 24} \to z_e \in \mathbb{R}^{T \times D}0 predicts xRT×24zeRT×Dx \in \mathbb{R}^{T \times 24} \to z_e \in \mathbb{R}^{T \times D}1 from noisy latent xRT×24zeRT×Dx \in \mathbb{R}^{T \times 24} \to z_e \in \mathbb{R}^{T \times D}2, time xRT×24zeRT×Dx \in \mathbb{R}^{T \times 24} \to z_e \in \mathbb{R}^{T \times D}3, and melodic codes xRT×24zeRT×Dx \in \mathbb{R}^{T \times 24} \to z_e \in \mathbb{R}^{T \times D}4.
  • Architecture: 10 LLaMA-style decoder layers, hidden dimension xRT×24zeRT×Dx \in \mathbb{R}^{T \times 24} \to z_e \in \mathbb{R}^{T \times D}5, totaling xRT×24zeRT×Dx \in \mathbb{R}^{T \times 24} \to z_e \in \mathbb{R}^{T \times D}6M parameters.
  • Conditioning: codes xRT×24zeRT×Dx \in \mathbb{R}^{T \times 24} \to z_e \in \mathbb{R}^{T \times D}7 embedded via learnable lookup (size 512xRT×24zeRT×Dx \in \mathbb{R}^{T \times 24} \to z_e \in \mathbb{R}^{T \times D}81024) and prepended as tokens.

3.3 Training Objectives

  • Flow-matching MSE:

xRT×24zeRT×Dx \in \mathbb{R}^{T \times 24} \to z_e \in \mathbb{R}^{T \times D}9

D=64D=640

(matches layer-D=64D=641 activations D=64D=642 of FM-Transformer to the D=64D=643-th layer of a pre-trained MERT model.)

  • Total generation loss:

D=64D=644

  • Classifier-free guidance: drop D=64D=645 with D=64D=646 during training; inference at CFG scale=3.

4. Training Regime and Evaluation Protocol

4.1 Datasets

  • Training: D=64D=6478,000 h of singing–accompaniment pairs from the SingNet pipeline (MSS-based separation, 3–30 s clips).
  • Evaluation:
    • YuE (3,000 separated vocals; in-domain)
    • MUSDB18 (2,777 clean vocal stems)
    • MoisesDB (2,500 solo instrumental tracks)
    • (All test material: 10 s clips, 24 kHz.)

4.2 Preprocessing and Hyperparameters

  • Audio resampled to 24 kHz.
  • Compute 80-bin mel-spectrograms (accompaniment target) and 24-bin chromagrams (conditioning).
  • VQ-VAE: 0.5M steps, batch size = 200 s (aggregate), AdamW, lr=D=64D=648, no warmup.
  • FM-Transformer: 1M steps, per-GPU batch = 100 s, AdamW, lr=D=64D=649, 32k warmup, CFG ERK×DE \in \mathbb{R}^{K \times D}0.
  • Inference: 50-step Euler sampler, CFG=3.
  • Vocoder: fine-tuned from Vevo on project music data.

5. Empirical Results

5.1 Objective Evaluation

Metrics: Accompaniment Prompt Adherence (APA↑), Fréchet Audio Distance (FAD↓), audiobox-aesthetics scores (CE↑, CU↑, PQ↑), and Production Complexity (PC).

Model APA FAD CE CU PQ PC
YuE
FastSAG 0.444 0.598 6.351 6.821 6.814 6.321
FM-Mel 0.806 0.416 6.964 7.725 7.758 5.614
FM-Chroma 0.633 0.418 7.151 7.801 7.909 5.436
AnyAccomp 0.713 0.414 7.283 7.903 7.989 5.742
MUSDB18
FastSAG 0.000 1.115 4.853 5.789 6.315 5.778
FM-Mel 0.167 0.999 5.202 6.616 6.841 4.090
FM-Chroma 0.704 0.798 7.017 7.598 7.744 5.104
AnyAccomp 0.710 0.788 7.277 7.804 7.891 5.498
MoisesDB
FastSAG 0.000 0.904 5.966 6.507 6.696 5.952
FM-Mel 0.000 0.936 5.424 6.923 7.151 3.804
FM-Chroma 0.157 0.849 6.308 7.377 7.508 4.110
AnyAccomp 0.203 0.890 6.660 7.581 7.581 4.798

On clean and instrumental data (MUSDB18, MoisesDB), AnyAccomp maintains high APA and audiobox-aesthetics scores, where baselines collapse (APAERK×DE \in \mathbb{R}^{K \times D}10).

5.2 Subjective Evaluation

Mean Opinion Scores (MOS) (scale: 1–5; 20 listeners, 20 clips each):

Dataset Model Quality↑ Coherency↑
YuE FastSAG 1.98 1.82
AnyAccomp 3.12 3.05
MUSDB18 FastSAG 1.73 1.48
AnyAccomp 3.23 2.75
MoisesDB FastSAG 1.62 1.52
AnyAccomp 3.00 2.70

5.3 Ablation and Analysis

  • FM-Mel (white-noised mel) and FM-Chroma (raw chroma) baselines isolate the VQ bottleneck's effect.
  • FM-Chroma recovers some generalization, but only VQ-Chroma (AnyAccomp) fully prevents overfitting.
  • Removing REPA loss degrades FAD by ERK×DE \in \mathbb{R}^{K \times D}20.05 and APA by ERK×DE \in \mathbb{R}^{K \times D}30.02, demonstrating the importance of representation alignment.

6. Limitations and Future Directions

Limitations include bottleneck resolution (50 Hz, 24 bins), which may miss fine pitch ornaments, and lack of explicit harmonic modeling beyond the melody. Computational demands are pronounced: a 220M-parameter transformer and iterative flow-matching with 50 sampling steps.

Potential future work includes adjusting bottleneck parameters for a trade-off between representation fidelity and compactness, exploring alternative front-ends such as Constant-Q Transform or pitch-contour encoders, enriching conditioning with chord/harmony or rhythm descriptors, and accelerating generation via distillation or fewer flow steps.

By decoupling accompaniment generation from separation artifacts, AnyAccomp is the first SAG model to generalize robustly to clean vocal and solo instrumental inputs, enabling universal co-creation tools that respond flexibly to a range of musical prompts (Zhang et al., 17 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AnyAccomp.