Anticipatory Music Transformer
- The paper introduces novel anticipatory mechanisms for symbolic music generation by interleaving control tokens with musical events.
- It employs flexible tokenization and multidimensional relative attention to robustly handle infilling and asynchronous conditioning.
- Experimental results show improvements in event-level perplexity and human evaluation, highlighting enhanced controllability and fidelity.
The Anticipatory Music Transformer is a family of controllable autoregressive models for symbolic music generation, addressing the challenge of asynchronous conditioning and infilling within temporal point processes. These architectures, initially formalized by Thickstun et al. (Thickstun et al., 2023) and evolving in subsequent large-scale foundation models such as Moonbeam (Guo et al., 21 May 2025), enable conditional sequence generation given partially specified musical structure.
1. Formal Model and Tokenization
A symbolic music piece is modeled as a marked temporal point process , with representing onset times and containing duration and note identity. The canonical tokenization employs 10 ms quantization for times and durations, packing each event into three tokens: quantized onset , duration , and note/instrument code , resulting in a vocabulary of approximately 27,512 types for event tokens in the original formulation.
Anticipatory models introduce a control process (e.g., fixed melody or chords) and generate a single interleaved token sequence by inserting control tokens following stopping times with respect to event times and an anticipation interval . This asynchronous interleaving provides the transformer prefix with control information strictly up to a bounded lookahead, facilitating "anticipation" of upcoming constraints while maintaining tractable causal sampling. Event and control tokens are diffrentiated through dedicated token ranges, and a domain-global control code () encodes the operational mode.
Moonbeam (Guo et al., 21 May 2025) generalizes tokenization with domain-informed compound events 0 (onset, duration, octave, pitch, instrument, velocity), using continuous Fundamental Music Embedding (FME) for all but the instrument attribute and storing absolute onsets to enable infilling alignment.
2. Training Objectives and Data Pipeline
The learning objective is to maximize the likelihood of interleaved sequences, encompassing both event and control tokens:
1
where 2 denotes the context window. Data originates from large MIDI corpora (Lakh MIDI: 8,940 hours/164k files (Thickstun et al., 2023), Moonbeam: 81.6K hours/18 billion tokens (Guo et al., 21 May 2025)) processed to filter extreme-length pieces or sparse instrumentations and ensure one event per second via REST token insertion.
Infilling and anticipation are taught via data augmentation: designating random or semantically meaningful subsets of events as controls under span-based, instrument-based, or random masking regimes, with substantial augmentation to promote control invariance. Moonbeam includes metadata attributes in conditions.
Optimization employs AdamW with standard schedules, dropout, and weight decay. Model capacities range from 128M to 780M parameters (12–36 layers) in Thickstun et al., and up to 839M for Moonbeam (S=309M, M=839M).
3. Anticipatory Sampling and Attention Mechanisms
At inference, users specify a subset of fixed control events and activate anticipatory mode (3). The sampling procedure interleaves controls according to the 4 anticipation interval: after observing each event, the model "looks ahead" by 5 to insert any controls becoming visible, while the remaining tokens are sampled autoregressively with nucleus sampling (6). Non-beam methods suffice due to the explicit locality bias.
Moonbeam's architecture introduces Multidimensional Relative Attention (MRA), extending rotary position encoding (RoPE) along five intrinsic event dimensions (onset, duration, octave, pitch, velocity). Attention heads are partitioned by axis; for group 7, 8 and 9 are rotated by the scalar position 0 associated with the corresponding attribute, using fixed sinusoidal encodings. This enables attention mechanisms to be sensitive to multidimensional relative musical relationships without increasing trainable parameters.
During infilling, the attention mask is constructed such that targets 1 may attend to all controls 2 (regardless of temporal ordering), while 3 enforces autoregressive masking internally; controls are fully self-visible but cannot attend forward into 4. This enables non-causal information flow from controls to generated segments and supports robust unconstrained infilling.
4. Experimental Evaluation and Baselines
Automatic evaluation comprises event-level perplexity (global and per-component), bits-per-second cross-entropy, and downstream metrics such as velocity/pitch accuracy and timing fit. In (Thickstun et al., 2023), anticipatory arrival-time encoding achieves event-level perplexities as low as 10.4 (medium model), improving over interarrival-time encoding by 55 bps.
Key benchmark table excerpt:
| Model / Steps | bps | ppl(event) | ppl(time) | ppl(dur) | ppl(note) |
|---|---|---|---|---|---|
| Arrival–Medium (800k) | 69.7 | 10.4 | 1.49 | 3.29 | 2.12 |
| FIGARO Transformer | — | — | — | — | — |
Human evaluation utilizes pairwise listening tests for prompt continuation and accompaniment. Anticipatory models exhibit comparable musicality to human performance in 20 s clips and significantly outperform non-anticipatory or random-retrieval models in accompaniment tasks (e.g., in the accompaniment test: wins=18, ties=31, losses=11 vs. human, 6).
Moonbeam benchmarks include music infilling on the CoMMU dataset, reporting objective accuracy gains for pitch and velocity and higher human ratings for fit and enjoyment.
5. Strengths, Limitations, and Theoretical Insights
Anticipatory Music Transformers achieve high controllability: arbitrary subsets of musical events, segments, or tracks can be specified as fixed, and the model completes the rest in a way consistent with fixed events and temporal dependencies. This yields fill-in-the-middle and flexible accompaniment abilities without sacrificing perplexity relative to standard autoregressive models. The locality bias induced by bounded anticipation windows regularizes attention and sampling scope.
Limitations are present in sparsely controlled regions (7), which may delay control satisfaction unless mitigated (e.g., REST tokens ensure minimum density). The anticipation horizon 8 is user-tunable—small 9 reduces lookahead, large 0 interpolates toward sequence-to-sequence behavior. Pretraining bias exists due to exclusive training on 12-tone MIDI, limiting support for microtones or non-Western music.
Moonbeam's multidimensional tokenization and attention further generalize the architecture, enabling richer relative encoding and scaling to larger and more diverse MIDI corpora while remaining computationally tractable via parameter-efficient inductive biases.
6. Directions for Extension
Immediate applications include real-time accompaniment (dynamic streaming of incoming control events), integration with DAWs for interactive conditional editing or selective infilling, and cross-modal extensions (conditioning on lyrics or text tokens). Scaling toward larger and more diverse symbolic datasets and models promises broader stylistic coverage.
A plausible implication is that anticipatory and asynchronous conditioning, as realized via explicit interleaving or advanced attention masking, will underpin future symbolic music generation frameworks, bridging melody, harmony, and structure in a controllable and sample-efficient manner (Thickstun et al., 2023, Guo et al., 21 May 2025).