MIDI-Informed Singing Accompaniment (MIDI-SAG)
- MIDI-SAG is the process of generating instrumental backing tracks conditioned on symbolic vocal-melody MIDI, offering clear rhythmic and harmonic priors.
- It leverages MIDI-derived chord progressions and vocal pitch contours to enhance controllability, reduce data requirements, and improve system efficiency.
- Evaluations using metrics like APA and Rhythm F1 demonstrate that MIDI-SAG outperforms audio-only methods in achieving accurate rhythm and key alignment.
MIDI-Informed Singing Accompaniment Generation (MIDI-SAG) is the task of generating an instrumental backing track for singing while conditioning explicitly on the symbolic vocal-melody MIDI rather than relying only on vocal audio. In the compositional formulation introduced in 2026, it appears as
where denotes lyrics, a text description, the vocal-melody MIDI score, the singing voice waveform, a chord progression derived from , and the final song audio. In that formulation, MIDI-SAG is , and its central premise is that symbolic melody and chord information provide explicit rhythmic and harmonic priors that are difficult to recover reliably from vocal audio alone (Tsai et al., 24 Feb 2026).
1. Definition, scope, and motivation
The defining distinction of MIDI-SAG is between conditioning on vocal audio alone and conditioning on vocal audio together with symbolic melody. In the compositional song-generation formulation, conventional audio-SAG is treated as learning , whereas MIDI-SAG learns
0
or, when ground-truth MIDI is unavailable at inference, 1, where 2 are extracted automatically from vocals. The explicit rationale is fourfold: MIDI provides a clear rhythmic grid; melody-conditioned harmonization yields explicit harmonic structure through AccoMontage2; symbolic priors reduce data and compute requirements; and symbolic intermediates improve controllability and editability (Tsai et al., 24 Feb 2026).
This motivation is partly a response to limitations reported for end-to-end text-to-song systems. The compositional pipeline argues that many open-source end-to-end systems require tens or hundreds of thousands of hours of audio and many high-end GPUs, with examples in the range of 3–4 hours and 5–6 A100/H100/H800, whereas the newly trained MIDI-SAG component in the compositional pipeline uses 7 hours and a single RTX 3090. The same comparison is linked to finer editability: melody, harmony, structure, and voice can be modified at intermediate stages instead of being entangled inside a monolithic waveform model (Tsai et al., 24 Feb 2026).
The broader literature situates MIDI-SAG at the intersection of three pre-existing problem families. Lead-conditional symbolic accompaniment generation had already been developed in multi-track MIDI systems such as PopMAG, where accompaniment is generated from a lead melody in MuMIDI representation (Ren et al., 2020). Audio-conditioned singing accompaniment generation had been formulated directly in waveform-token space by systems such as SingSong (Donahue et al., 2023). Real-time computer accompaniment had independently addressed score following, tempo adaptation, and expressive accompaniment in systems such as ACCompanion and SongDriver (Cancino-Chacón et al., 2017). MIDI-SAG inherits from all three traditions, but combines them around the symbolic vocal melody as the primary structural control variable.
2. Symbolic conditioning and representational foundations
Across the literature, three conditioning regimes recur.
| Conditioning regime | Typical inputs | Representative works |
|---|---|---|
| Symbolic lead-conditioned generation | vocal/lead MIDI, bar/position tokens, chord tracks | PopMAG; Calliope/MMM |
| Audio-conditioned accompaniment generation | vocal waveform, semantic/acoustic tokens, text prompts | SingSong; Melodist; Llambada; AnyAccomp |
| Hybrid symbolic-audio accompaniment | vocal MIDI, chords, vocal audio, text, reference audio | ComposerFlow MIDI-SAG; MelodyLM |
The most explicit symbolic accompaniment representation in the cited work is MuMIDI. PopMAG defines accompaniment generation as a conditional seq2seq problem over MuMIDI, with conditional tracks 8 and target tracks 9, and the core modeling objective
0
MuMIDI merges multiple tracks into a single token stream using bar tokens, position tokens, and note attributes, with notes emitted as pitch–velocity–duration triples. The representation is explicitly bar-aware and position-aware, and PopMAG augments it with encoder and decoder memories 1 and 2 for long-context modeling. Its design is described as applying almost directly to a MIDI-SAG setup if the lead melody is a vocal melody transcribed to MIDI (Ren et al., 2020).
A second symbolic formulation appears in MelodyLM, where MIDI is modeled directly as an event sequence
3
with pitch tokens 4 and duration tokens 5. MelodyLM further expands MIDI pitches over durations into per-frame pitch sequences before vocal generation, and its ablation results show that unexpanded MIDI degrades prosody relative to per-frame symbolic conditioning. This establishes a concrete precedent for treating MIDI not only as a score-level control but as a temporally dense conditioning signal (Li et al., 2024).
The compositional MIDI-SAG pipeline extends symbolic conditioning beyond melody. At training, vocal pitch contour, rhythm, chord, structure, key, and reference audio are extracted from ground-truth audio; at inference, rhythm comes from vocal MIDI, chords from AccoMontage2 harmonization of the melody, structure from user-provided section tags, and key from user input or vocal MIDI. Chords are represented as time-varying 12-bin chromagrams; beats and downbeats become binary time series that are Gaussian-smoothed into rhythm activation curves; and all controls are interpolated to match the diffusion model’s latent timeline (Tsai et al., 24 Feb 2026).
Audio-only systems also contain representation ideas relevant to MIDI-SAG. AnyAccomp introduces a quantized melodic bottleneck built from a 24-bin chromagram at 50 Hz and a VQ-VAE with codebook size 512, producing a discrete, timbre-invariant melody representation. The paper explicitly frames this as a way to isolate melody from source-dependent artifacts. A plausible implication is that such learned discrete melodic codes can serve as an intermediate representation when symbolic MIDI is unavailable or unreliable, while true MIDI remains the cleaner and more controllable source when it is available (Zhang et al., 17 Sep 2025).
3. Architectural families and generation mechanisms
The symbolic multi-track lineage is represented by PopMAG and Calliope/MMM. PopMAG uses a recurrent Transformer encoder–decoder with memory, token embedding dimension 512, 4 encoder layers, 8 decoder layers, 8 attention heads, dropout 0.1, and 49.01M parameters. Its encoder consumes conditional MuMIDI tokens, and its decoder generates accompaniment tokens bar by bar with top-6 temperature-controlled stochastic sampling (Ren et al., 2020). Calliope wraps a Transformer-based bar-level symbolic model, MMM, inside a web-based co-creative workflow that supports MIDI upload, piano-roll editing, bar in-filling, batch generation, and DAW streaming. In a MIDI-SAG setting, the vocal melody is treated as a fixed track while accompaniment bars are in-filled around it (Tchemeube et al., 18 Apr 2025).
The audio-token lineage is represented by SingSong, Melodist, and Llambada. SingSong adapts AudioLM to accompaniment generation by modeling
7
with a T5.1.1 encoder–decoder for semantic plus coarse acoustic tokens and a decoder-only model for fine acoustic tokens. Its best-performing conditioning is semantic-only noisy vocal input,
8
denoted S-SA, which was introduced to reduce overfitting to source-separation artifacts (Donahue et al., 2023). Melodist formulates vocal-to-accompaniment synthesis as a second-stage autoregressive acoustic-token model conditioned on vocal acoustic tokens and text-prompt embeddings, using a multi-scale Transformer over neural codec tokens and a tri-tower contrastive text encoder (Hong et al., 2024). Llambada similarly uses a two-stage T5 design over discrete semantic and acoustic tokens, with the factorization
9
thereby separating semantic accompaniment planning from acoustic realization (Trinh et al., 2024).
The most direct MIDI-SAG architecture in the cited material is the 2026 compositional pipeline. It adapts Stable Audio Open, a latent diffusion text-to-music model, using MuseControlLite. Conditioning streams include vocal pitch contour, rhythm, chord, key, structure, and reference audio. The denoising objective is the standard diffusion loss
0
MuseControlLite adapters inject the time-varying controls into the diffusion model’s cross-attention layers; the controls are interpolated to latent resolution and concatenated along the cross-attention feature dimension; and selected self-attention layers in the Stable Audio Open backbone are partially unfrozen to improve continuity across section boundaries (Tsai et al., 24 Feb 2026).
MelodyLM supplies a closely related hybrid design. It uses a three-stage pipeline in which a MIDI LLM produces melody, a vocal LLM renders singing conditioned on expanded MIDI and lyrics, and a latent diffusion model generates accompaniment conditioned on vocal acoustic units and text prompts. Its hybrid conditioning mechanism first fuses vocal acoustic features with noisy accompaniment latents channel-wise,
1
then concatenates text features temporally,
2
This demonstrates a concrete route by which frame-aligned symbolic MIDI can be brought into accompaniment diffusion, either by replacing generated MIDI with user-supplied MIDI or by adding accompaniment-side MIDI as an extra control stream (Li et al., 2024).
Taken together, these systems suggest two main MIDI-SAG implementation regimes. One is symbolic-first: generate or in-fill accompaniment in multi-track MIDI and then render it. The other is audio-first with symbolic control: keep the accompaniment generator in acoustic token or latent-audio space, but use vocal MIDI and derived harmony as explicit conditioning. Both regimes are directly instantiated in the cited work.
4. Alignment, intermittent vocals, and live interaction
A defining challenge for MIDI-SAG is that realistic songs contain intermittent vocals. The compositional MIDI-SAG pipeline addresses this by keeping rhythm and chord controls active even in non-vocal regions and by generating section by section with audio continuation. Its section-anchored slicing aligns windows to intro, verse, chorus, bridge, or outro boundaries rather than arbitrary 47-second chunks. For intros, it introduces backward continuation: when a target window begins with an intro, the reference audio is sometimes replaced with a later verse section during training, teaching the model to make intros that are stylistically and harmonically consistent with later material. At inference, verse is generated first, intro is generated using verse audio as backward reference, and later sections are generated conditioned on previously generated audio (Tsai et al., 24 Feb 2026).
For audio-only accompaniment generation, SingSong uses a simpler sliding-window continuation policy for inputs longer than 10 seconds: the model first generates accompaniment for the first 10 seconds, then for each subsequent 5-second segment conditions on the vocal window 3 and prompts the decoder with the last 5 seconds of generated accompaniment tokens. This preserves some continuity across windows, but it does not supply the explicit section structure or symbolic harmonic plan that MIDI-SAG later adds (Donahue et al., 2023).
Live accompaniment introduces a distinct problem: the system must track or anticipate performance timing. “Streaming Generation for Music Accompaniment” formalizes streaming accompaniment as
4
where 5 is future visibility and 6 is output chunk duration. The paper shows two trade-offs: increasing effective 7 improves coherence by reducing the recency gap but requires faster inference to stay within the latency budget, and increasing 8 improves throughput but degrades accompaniment because the update rate drops. It also reports that naive maximum-likelihood streaming training is insufficient when future context is unavailable, especially for realistic negative-9 regimes required for live jamming (Wu et al., 25 Oct 2025).
In symbolic real-time accompaniment, SongDriver addresses latency with a two-phase design. A Transformer arrangement phase reads a quantized monophonic melody every beat and caches chords for past beats; a CRF prediction phase uses the previous 8 beats of arranged chords, longest notes, and bar indices to predict the chord for the upcoming beat, which is immediately converted into multi-track texture and played. The paper calls this zero logical latency and argues that exposure bias is avoided because the CRF never conditions on its own previously predicted chords, only on the Transformer’s cached arrangement and recent melody (Wang et al., 2022).
Real-time singing accompaniment also depends on score following and tempo adaptation. ACCompanion uses a monophonic HMM-based score follower with a switching Kalman filter over beat period, then applies a Basis-Mixer model to generate expressive dynamics, timing, and articulation in accompaniment MIDI (Cancino-Chacón et al., 2017). HeurMiT replaces classical score following with neural latent template matching between a score context 0 and a performance window 1, followed by heuristic smoothing, but extensive experiments conclude that its current form is not practical in real-world score-following scenarios, especially under tempo mismatch (Pillay, 8 Mar 2025). This suggests that live MIDI-SAG is likely to require hybrid systems in which symbolic planning, score following, and audio- or MIDI-derived performance tracking are jointly engineered rather than treated as independent modules.
5. Evaluation protocols and reported performance
The cited literature evaluates accompaniment generation with several metric families. Symbolic or harmony-oriented studies use metrics such as pitch class histogram, tonal distance, chord inference correctness, note density, syncopation, CTnCTR, MCTD, WMCH, CS, and HS (Ren et al., 2020, Wang et al., 2022). Audio-generation systems emphasize FAD, NLL, CLAP similarity, Audiobox-Aesthetics, SongEval, and listening tests (Donahue et al., 2023, Hong et al., 2024, Trinh et al., 2024, Tsai et al., 24 Feb 2026). The compositional MIDI-SAG pipeline adds Accompaniment Prompt Adherence (APA), Rhythm F1, Key Accuracy, Chord F1, and PER. This suggests that MIDI-SAG is typically assessed along four axes: harmonic fit to melody, rhythmic alignment, prompt/style adherence, and perceptual quality (Tsai et al., 24 Feb 2026).
| Setting | Metric(s) | Reported result |
|---|---|---|
| PopMAG vs ground truth | Preference votes | 42% / 38% / 40% on LMD / FreeMidi / CPMD |
| SingSong-XL vs Retrieval | Pairwise preference | 66%; 2 |
| MIDI-SAG Experiment 1 (no GT MIDI at inference) | APA | 0.595 vs AnyAccomp 0.457 and FastSAG 0.000 |
| MIDI-SAG Experiment 2 | Rhythm F1 / Key Acc | Audio-SAG: 0.64 / 0.55; MIDI-SAG w/ GT MIDI: 0.91 / 0.93; MIDI-SAG w/o GT MIDI: 0.77 / 0.91 |
The most direct evidence for MIDI-SAG comes from the 2026 experiments. In 10-second accompaniment generation without ground-truth MIDI at inference, the MIDI-SAG system reports APA 3, compared with AnyAccomp at 4 and FastSAG at 5. In 47-second experiments using the same Stable Audio Open backbone, adding MIDI conditioning raises Rhythm F1 from 6 in audio-SAG to 7 with ground-truth MIDI, while Key Accuracy rises from 8 to 9; even automatically extracted MIDI retains 0 Rhythm F1 and 1 Key Accuracy (Tsai et al., 24 Feb 2026).
Ablation results in the same work localize the contribution of symbolic controls. Removing chord conditioning drops Chord F1 from approximately 2 to 3 and Key Accuracy from 4 to 5, while removing key has minimal effect because chords already encode key information; removing vocal pitch contour also heavily degrades chord and rhythm metrics. These observations sharply support the claim that MIDI-derived rhythm and harmony are not merely auxiliary metadata but primary structural controls in the model (Tsai et al., 24 Feb 2026).
The full compositional pipeline evaluation extends the picture beyond accompaniment alone. It reports PER 6, best among the compared systems, and a voice naturalness MOS of 7, stated to be the best score among open-source models. In subjective evaluation, the system trails Suno overall but surpasses or approaches several open-source baselines in lyrics adherence and voice naturalness (Tsai et al., 24 Feb 2026).
Outside direct MIDI-SAG, other results remain methodologically relevant. PopMAG’s multi-track symbolic generation outperforms other accompaniment models and wins nontrivial preference shares even against ground truth (Ren et al., 2020). SingSong significantly outperforms a retrieval baseline conditioned on key and tempo, with SingSong-XL preferred in 66% of trials (Donahue et al., 2023). SongDriver outperforms latency and bias baselines in melody–accompaniment harmoniousness, coherence of chord progression, and melody–harmonic synchronization in long-form real-time tests (Wang et al., 2022). While these results are not numerically comparable across datasets or modalities, they collectively indicate that explicit structure, whether symbolic or learned, is decisive for accompaniment quality.
6. Limitations, misconceptions, and research directions
Several limitations recur across the literature. The compositional MIDI-SAG pipeline still exhibits a gap between ground-truth and automatically extracted MIDI, most visibly in Rhythm F1 falling from 8 to 9, and it remains subject to error propagation across melody composition, singing voice synthesis, harmonization, and accompaniment generation. Its training data are mainly Mandarin pop, and the pipeline is described as supporting only Mandarin lyrics; richer arrangements and longer forms are identified as open areas (Tsai et al., 24 Feb 2026). These are limitations of current systems, not of symbolic conditioning in principle.
A common misconception is that vocal audio alone supplies sufficient structure for accompaniment. The cited audio-only work argues otherwise. SingSong reports that generated instrumentals often have strong percussive elements but weaker harmonic content, and notes the lack of user control over genre, style, rhythmic density, and instrumentation (Donahue et al., 2023). AnyAccomp goes further, identifying a train–test mismatch caused by source-separated-vocal conditioning and arguing that prior systems overfit to separation artifacts; it introduces a quantized melodic bottleneck precisely to decouple accompaniment generation from source-dependent artifacts (Zhang et al., 17 Sep 2025). MIDI-SAG can be understood as a stronger symbolic remedy to the same structural problem.
The opposite misconception is that MIDI alone solves accompaniment generation. The reported no-ground-truth-MIDI gap already shows that extracted symbolic melody is imperfect (Tsai et al., 24 Feb 2026). More broadly, symbolic MIDI does not encode timbre, studio production, or all vocal expressivity. ACCompanion explicitly separates score following, tempo tracking, and expressive performance generation, underscoring that accompaniment quality depends not only on harmonic correctness but also on timing, articulation, and dynamics (Cancino-Chacón et al., 2017). A plausible implication is that strong MIDI-SAG systems will remain hybrid rather than purely symbolic: MIDI provides harmonic and rhythmic scaffolding, while audio features preserve phrasing and expression.
Real-time MIDI-SAG remains particularly unresolved. Streaming accompaniment studies show that pure MLE training collapses in realistic negative-0 regimes and motivates anticipatory and agentic objectives (Wu et al., 25 Oct 2025). HeurMiT shows that lightweight neural score following can be computationally efficient but still fail under real-world tempo variability (Pillay, 8 Mar 2025). SongDriver demonstrates that symbolic two-phase planning can eliminate logical latency in grid-based accompaniment (Wang et al., 2022), but its formulation assumes discrete-time, fixed-meter, quantized melody streams. Extending that logic to live singing will require robust beat tracking, phrase-aware quantization, or score-following layers that tolerate rubato and ornamentation.
Future directions are explicit in several papers. PopMAG points toward richer conditional symbolic generation with longer memory and unified multi-track modeling (Ren et al., 2020). Calliope indicates practical workflows around bar in-filling, batch generation, ranking, and DAW integration, all of which transfer naturally to a fixed-vocal-track accompaniment setting (Tchemeube et al., 18 Apr 2025). SingSong calls for better input featurization, multi-source accompaniment, structured control, and higher-fidelity codecs (Donahue et al., 2023). The streaming work calls for anticipatory and agentic objectives for live jamming (Wu et al., 25 Oct 2025). Taken together, the literature suggests that the likely trajectory of MIDI-SAG is toward multimodal systems in which vocal audio, vocal MIDI, chord plans, structure tags, and user prompts coexist as complementary controls rather than competing representations.