Anticipatory Music Transformer (AMT)
- Anticipatory Music Transformer (AMT) is an autoregressive model that interleaves event and control tokens, enabling controllable symbolic music generation.
- It modifies a standard GPT-2 architecture with arrival-time tokenization and vocabulary doubling to support asynchronous control through anticipation.
- Empirical results demonstrate that AMT maintains tractable causal generation with improved musical infilling and accompaniment quality.
The Anticipatory Music Transformer (AMT) is a class of autoregressive generative models for symbolic music that generalize the standard Transformer architecture to permit asynchronous control via a formally defined mechanism called "anticipation." This framework enables tractable, controllable generation of event sequences (such as musical notes) conditioned on arbitrary subsets of future events, a critical capability for musical infilling and accompaniment. AMT interleaves event and control tokens in a single causal stream using a precise stopping-time criterion, thus preserving both model tractability and causal generation. This approach is implemented via minor data-level modifications to a GPT-2 style Transformer, allowing standard architectures and optimization regimes to be used without core changes (Thickstun et al., 2023).
1. Anticipation and Temporal Point Process Modeling
AMT models two marked temporal point processes:
- The event process comprises arrival times and marks (e.g., pitch, velocity) from a finite vocabulary .
- The control process includes control times and control-marks in , often for infilling.
Anticipation enables the modeling of the joint conditional distribution by interleaving the event and control sequences using the operator , producing a single sequence . Each control at time is “anticipated” into the sequence at index
and each event is inserted at
Here, is the anticipation interval—a hyperparameter determining how far in advance controls manifest in the causal stream. The interleaving preserves the stopping-time property: depends only on the prefix , ensuring causal, tractable sampling.
After interleaving, the joint is factorized autoregressively:
and, for conditional generation, the model masks the loss on control indices during training (Thickstun et al., 2023).
2. Model Architecture and Token Sequence Construction
AMT employs a decoder-only (causally masked) Transformer architecture closely aligned with GPT-2, but introduces critical data-side modifications:
- Arrival-time tokenization: Each event is quantized into three tokens—onset , duration , and note . This encoding yields context-free triplet tokens (3N total) that are permutation-compatible for interleaving.
- Vocabulary doubling (control vs event): Control and event tokens are assigned disjoint ID ranges, enabling the Transformer to distinguish “anticipated” (future control) tokens from historical event tokens of the same type. Every sequence is prepended with a single global control code , indicating unconditional or anticipatory generation mode.
- Position embeddings: Uses standard absolute positional embeddings over sequence positions . No special embeddings for anticipation or stopping-times are required.
- Attention masking: Utilizes causal (decoder-only) masking; no need for bidirectional or custom sparse mechanisms, as asynchronous controls are already woven into the causal stream.
Other architecture components—multi-head self-attention, layer normalization, GeLU activations, and residual connections—are standard (Thickstun et al., 2023).
3. Maximum Likelihood Training and Evaluation Metrics
Training proceeds with a maximum-likelihood objective. Given a packed example (sequence of interleaved tokens plus control code),
with per-token cross-entropy loss
For event-wise analysis, tokens are grouped into triplets (onset, duration, note), and the event loss is:
with corresponding event, onset, duration, and note perplexities reported as , alongside . No auxiliary losses or discriminators are used—plain autoregressive cross-entropy suffices (Thickstun et al., 2023).
4. Dataset Processing, Model Scaling, and Training Regimes
AMT is trained on the filtered Lakh MIDI dataset (Raffel et al. 2016): 164,747 MIDI sequences, comprising 7,827h training, 555h validation, and 561h test data. Preprocessing steps include:
- Exclusion of sequences 10s or 1h and those with 16 instruments.
- Sparsity correction: REST tokens are inserted every 1s between sparse events to prevent the anticipation interval ( up to 5s) from overshooting the target time.
- Tokenization: Arrival-time encoding yields a 27,512-token vocabulary, including REST/SEP/control tokens.
Model configurations are as follows:
| Scale | Params | Layers (L) | Heads (H) | Dim (d) |
|---|---|---|---|---|
| Small | 128 M | 12 | 12 | 768 |
| Medium | 360 M | 24 | 16 | 1024 |
| Large | 780 M | 36 | 20 | 1280 |
Optimization uses AdamW (), weight decay $0.1$, dropout $0.1$, gradient clipping (), with a linear warmup followed by cosine decay on the learning rate ( for Small, Medium, Large). Training utilizes 512 sequences per step (context length 1024), on TPU v3–32 slices. Throughput is 0.7M tok/s (Small), 0.26M (Medium), 0.14M (Large). Training steps vary by model and experiment; refer to Table 5.1 in (Thickstun et al., 2023).
5. Quantitative and Qualitative Evaluation
Quantitative evaluation on the test set without anticipation () uses bits/sec (b/s) and event perplexity. Representative results:
| Config | Params | Encoding | AM? | b/s | ppl(e) | ppl() | ppl(d) | ppl(n) |
|---|---|---|---|---|---|---|---|---|
| Small interarrival | 112 M | interarrival | — | 85.9 | — | — | — | — |
| Small arrival (100k) | 128 M | arrival | AAR | 80.7 | 15.0 | 1.58 | 3.98 | 2.40 |
| Small arrival (800k) | 128 M | arrival | AAR | 75.0 | 12.4 | 1.52 | 3.64 | 2.24 |
| Medium (800k) | 360 M | arrival | AAR | 69.7 | 10.4 | 1.49 | 3.29 | 2.12 |
Findings:
- Arrival-time encoding consistently outperforms interarrival encoding.
- Anticipation (AAR) introduces negligible overhead on unconditional log-loss; overhead vanishes with longer training.
- Scaling model size and training duration reduces perplexity.
Human evaluation via pairwise judgments was conducted on Amazon Mechanical Turk with 20 s synthesized audio clips (nucleus sampling ):
Prompt continuation experiments (3-bar prompt + 14 s continuation):
- Human vs. Medium: humans slightly preferred ().
- Medium vs. Figaro baseline: Medium strongly preferred ().
- Small arrival vs. Small interarrival: Small arrival preferred ().
Accompaniment experiments (5 s melody prompt + 15 s accompaniment):
- Human vs. AAR: no significant preference ().
- Human vs. AR baseline: human overwhelmingly preferred ().
- AAR vs. AR baseline: AAR vastly preferred ().
- AAR vs. random retrieval: AAR strongly preferred ().
These results indicate that AMT can produce accompaniments with musicality rivalling human-composed samples over 20-second timescales (Thickstun et al., 2023).
6. Functional Advantages and Significance
AMT’s key contribution is a principled method—anticipation—for -second-ahead conditioning on asynchronous controls, achieved via data-side modifications alone. No structural or algorithmic changes to Transformer architectures are necessary, retaining compatibility with standard frameworks and optimization techniques. This construction enables tractable infilling and accompaniment tasks in symbolic music generation, matching or exceeding standard autoregressive baselines in both unconditional and conditional scenarios. Empirical results demonstrate negligible performance penalty for anticipation and strong human preference for AMT-generated accompaniments when compared to baselines.
A plausible implication is that anticipation as a general mechanism may extend to controllable sequence modeling outside music, whenever asynchronous or infilling control is desired within causal autoregressive models (Thickstun et al., 2023).