Papers
Topics
Authors
Recent
Search
2000 character limit reached

Anticipatory Music Transformer

Updated 24 April 2026
  • The paper introduces novel anticipatory mechanisms for symbolic music generation by interleaving control tokens with musical events.
  • It employs flexible tokenization and multidimensional relative attention to robustly handle infilling and asynchronous conditioning.
  • Experimental results show improvements in event-level perplexity and human evaluation, highlighting enhanced controllability and fidelity.

The Anticipatory Music Transformer is a family of controllable autoregressive models for symbolic music generation, addressing the challenge of asynchronous conditioning and infilling within temporal point processes. These architectures, initially formalized by Thickstun et al. (Thickstun et al., 2023) and evolving in subsequent large-scale foundation models such as Moonbeam (Guo et al., 21 May 2025), enable conditional sequence generation given partially specified musical structure.

1. Formal Model and Tokenization

A symbolic music piece is modeled as a marked temporal point process E={(ti,mi)}i=1N\mathcal{E} = \{(t_i, m_i)\}_{i=1}^N, with tit_i representing onset times and mi=(Δi,νi)m_i = (\Delta_i, \nu_i) containing duration and note identity. The canonical tokenization employs 10 ms quantization for times and durations, packing each event into three tokens: quantized onset t~i\tilde t_i, duration Δ~i\tilde\Delta_i, and note/instrument code νi\nu_i, resulting in a vocabulary of approximately 27,512 types for event tokens in the original formulation.

Anticipatory models introduce a control process u1:K⊆Eu_{1:K} \subseteq \mathcal{E} (e.g., fixed melody or chords) and generate a single interleaved token sequence a1:N+Ka_{1:N+K} by inserting control tokens following stopping times with respect to event times and an anticipation interval δ\delta. This asynchronous interleaving provides the transformer prefix with control information strictly up to a bounded lookahead, facilitating "anticipation" of upcoming constraints while maintaining tractable causal sampling. Event and control tokens are diffrentiated through dedicated token ranges, and a domain-global control code (z∈{AR,AAR}z\in \{\mathrm{AR}, \mathrm{AAR}\}) encodes the operational mode.

Moonbeam (Guo et al., 21 May 2025) generalizes tokenization with domain-informed compound events tit_i0 (onset, duration, octave, pitch, instrument, velocity), using continuous Fundamental Music Embedding (FME) for all but the instrument attribute and storing absolute onsets to enable infilling alignment.

2. Training Objectives and Data Pipeline

The learning objective is to maximize the likelihood of interleaved sequences, encompassing both event and control tokens:

tit_i1

where tit_i2 denotes the context window. Data originates from large MIDI corpora (Lakh MIDI: 8,940 hours/164k files (Thickstun et al., 2023), Moonbeam: 81.6K hours/18 billion tokens (Guo et al., 21 May 2025)) processed to filter extreme-length pieces or sparse instrumentations and ensure one event per second via REST token insertion.

Infilling and anticipation are taught via data augmentation: designating random or semantically meaningful subsets of events as controls under span-based, instrument-based, or random masking regimes, with substantial augmentation to promote control invariance. Moonbeam includes metadata attributes in conditions.

Optimization employs AdamW with standard schedules, dropout, and weight decay. Model capacities range from 128M to 780M parameters (12–36 layers) in Thickstun et al., and up to 839M for Moonbeam (S=309M, M=839M).

3. Anticipatory Sampling and Attention Mechanisms

At inference, users specify a subset of fixed control events and activate anticipatory mode (tit_i3). The sampling procedure interleaves controls according to the tit_i4 anticipation interval: after observing each event, the model "looks ahead" by tit_i5 to insert any controls becoming visible, while the remaining tokens are sampled autoregressively with nucleus sampling (tit_i6). Non-beam methods suffice due to the explicit locality bias.

Moonbeam's architecture introduces Multidimensional Relative Attention (MRA), extending rotary position encoding (RoPE) along five intrinsic event dimensions (onset, duration, octave, pitch, velocity). Attention heads are partitioned by axis; for group tit_i7, tit_i8 and tit_i9 are rotated by the scalar position mi=(Δi,νi)m_i = (\Delta_i, \nu_i)0 associated with the corresponding attribute, using fixed sinusoidal encodings. This enables attention mechanisms to be sensitive to multidimensional relative musical relationships without increasing trainable parameters.

During infilling, the attention mask is constructed such that targets mi=(Δi,νi)m_i = (\Delta_i, \nu_i)1 may attend to all controls mi=(Δi,νi)m_i = (\Delta_i, \nu_i)2 (regardless of temporal ordering), while mi=(Δi,νi)m_i = (\Delta_i, \nu_i)3 enforces autoregressive masking internally; controls are fully self-visible but cannot attend forward into mi=(Δi,νi)m_i = (\Delta_i, \nu_i)4. This enables non-causal information flow from controls to generated segments and supports robust unconstrained infilling.

4. Experimental Evaluation and Baselines

Automatic evaluation comprises event-level perplexity (global and per-component), bits-per-second cross-entropy, and downstream metrics such as velocity/pitch accuracy and timing fit. In (Thickstun et al., 2023), anticipatory arrival-time encoding achieves event-level perplexities as low as 10.4 (medium model), improving over interarrival-time encoding by mi=(Δi,νi)m_i = (\Delta_i, \nu_i)55 bps.

Key benchmark table excerpt:

Model / Steps bps ppl(event) ppl(time) ppl(dur) ppl(note)
Arrival–Medium (800k) 69.7 10.4 1.49 3.29 2.12
FIGARO Transformer — — — — —

Human evaluation utilizes pairwise listening tests for prompt continuation and accompaniment. Anticipatory models exhibit comparable musicality to human performance in 20 s clips and significantly outperform non-anticipatory or random-retrieval models in accompaniment tasks (e.g., in the accompaniment test: wins=18, ties=31, losses=11 vs. human, mi=(Δi,νi)m_i = (\Delta_i, \nu_i)6).

Moonbeam benchmarks include music infilling on the CoMMU dataset, reporting objective accuracy gains for pitch and velocity and higher human ratings for fit and enjoyment.

5. Strengths, Limitations, and Theoretical Insights

Anticipatory Music Transformers achieve high controllability: arbitrary subsets of musical events, segments, or tracks can be specified as fixed, and the model completes the rest in a way consistent with fixed events and temporal dependencies. This yields fill-in-the-middle and flexible accompaniment abilities without sacrificing perplexity relative to standard autoregressive models. The locality bias induced by bounded anticipation windows regularizes attention and sampling scope.

Limitations are present in sparsely controlled regions (mi=(Δi,νi)m_i = (\Delta_i, \nu_i)7), which may delay control satisfaction unless mitigated (e.g., REST tokens ensure minimum density). The anticipation horizon mi=(Δi,νi)m_i = (\Delta_i, \nu_i)8 is user-tunable—small mi=(Δi,νi)m_i = (\Delta_i, \nu_i)9 reduces lookahead, large t~i\tilde t_i0 interpolates toward sequence-to-sequence behavior. Pretraining bias exists due to exclusive training on 12-tone MIDI, limiting support for microtones or non-Western music.

Moonbeam's multidimensional tokenization and attention further generalize the architecture, enabling richer relative encoding and scaling to larger and more diverse MIDI corpora while remaining computationally tractable via parameter-efficient inductive biases.

6. Directions for Extension

Immediate applications include real-time accompaniment (dynamic streaming of incoming control events), integration with DAWs for interactive conditional editing or selective infilling, and cross-modal extensions (conditioning on lyrics or text tokens). Scaling toward larger and more diverse symbolic datasets and models promises broader stylistic coverage.

A plausible implication is that anticipatory and asynchronous conditioning, as realized via explicit interleaving or advanced attention masking, will underpin future symbolic music generation frameworks, bridging melody, harmony, and structure in a controllable and sample-efficient manner (Thickstun et al., 2023, Guo et al., 21 May 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Anticipatory Music Transformer.