Anticipatory Music Transformer (AMT)

Updated 9 February 2026

Anticipatory Music Transformer (AMT) is an autoregressive model that interleaves event and control tokens, enabling controllable symbolic music generation.
It modifies a standard GPT-2 architecture with arrival-time tokenization and vocabulary doubling to support asynchronous control through anticipation.
Empirical results demonstrate that AMT maintains tractable causal generation with improved musical infilling and accompaniment quality.

The Anticipatory Music Transformer (AMT) is a class of autoregressive generative models for symbolic music that generalize the standard Transformer architecture to permit asynchronous control via a formally defined mechanism called "anticipation." This framework enables tractable, controllable generation of event sequences (such as musical notes) conditioned on arbitrary subsets of future events, a critical capability for musical infilling and accompaniment. AMT interleaves event and control tokens in a single causal stream using a precise stopping-time criterion, thus preserving both model tractability and causal generation. This approach is implemented via minor data-level modifications to a GPT-2 style Transformer, allowing standard architectures and optimization regimes to be used without core changes (Thickstun et al., 2023).

1. Anticipation and Temporal Point Process Modeling

AMT models two marked temporal point processes:

The event process $\{e_j=(t_j, m_j)\}_{j=1}^N$ comprises arrival times $0 \leq t_1 \leq \dotsc \leq t_N$ and marks (e.g., pitch, velocity) $m_j$ from a finite vocabulary $\mathcal{V}$ .
The control process $\{u_k = (s_k, c_k)\}_{k=1}^K$ includes control times $0 \leq s_1 \leq \dotsc \leq s_K$ and control-marks $c_k$ in $\mathcal{U}$ , often $\mathcal{U} = \mathcal{V}$ for infilling.

Anticipation enables the modeling of the joint conditional distribution $p(e_{1:N} \mid u_{1:K})$ by interleaving the event and control sequences using the operator $\mathrm{interleave}_\delta(e_{1:N}, u_{1:K})$ , producing a single sequence $a_{1:N+K}$ . Each control $u_k$ at time $s_k$ is “anticipated” into the sequence at index

$\tau_{u_k} = k + \min \{j: t_j \ge s_k - \delta \}$

and each event $e_j$ is inserted at

$\tau_{e_j} = j + \max \{k: t_{j-1} \ge s_k - \delta \}.$

Here, $\delta > 0$ is the anticipation interval—a hyperparameter determining how far in advance controls manifest in the causal stream. The interleaving preserves the stopping-time property: $\tau_{u_k} - 1$ depends only on the prefix $a_{1:i-1}$ , ensuring causal, tractable sampling.

After interleaving, the joint is factorized autoregressively:

$p(a_{1:N+K}) = \prod_{i=1}^{N+K} p(a_i \mid a_{1:i-1}),$

and, for conditional generation, the model masks the loss on control indices during training (Thickstun et al., 2023).

2. Model Architecture and Token Sequence Construction

AMT employs a decoder-only (causally masked) Transformer architecture closely aligned with GPT-2, but introduces critical data-side modifications:

Arrival-time tokenization: Each event $e_j = (t_j, d_j, n_j)$ is quantized into three tokens—onset $\tilde t_j = \mathrm{floor}(t_j/0.01)$ , duration $d_j$ , and note $n_j$ . This encoding yields context-free triplet tokens (3N total) that are permutation-compatible for interleaving.
Vocabulary doubling (control vs event): Control and event tokens are assigned disjoint ID ranges, enabling the Transformer to distinguish “anticipated” (future control) tokens from historical event tokens of the same type. Every sequence is prepended with a single global control code $z \in \{\mathrm{AR}, \mathrm{AAR}\}$ , indicating unconditional or anticipatory generation mode.
Position embeddings: Uses standard absolute positional embeddings over sequence positions $1\dotsc 1024$ . No special embeddings for anticipation or stopping-times are required.
Attention masking: Utilizes causal (decoder-only) masking; no need for bidirectional or custom sparse mechanisms, as asynchronous controls are already woven into the causal stream.

Other architecture components—multi-head self-attention, layer normalization, GeLU activations, and residual connections—are standard (Thickstun et al., 2023).

3. Maximum Likelihood Training and Evaluation Metrics

Training proceeds with a maximum-likelihood objective. Given a packed example $x_{1:M}$ (sequence of interleaved tokens plus control code),

$\max_{\theta} \sum_{i=1}^M \log p_{\theta}(x_i \mid z, x_{1:i-1})$

with per-token cross-entropy loss

$L_{\mathrm{token}} = - \sum_{i=1}^M \log p(x_i \mid x_{<i}).$

For event-wise analysis, tokens are grouped into triplets (onset, duration, note), and the event loss is:

$L_{\mathrm{event}} = -\sum_{j=1}^{N} [\log p(\tilde t_j | \cdots) + \log p(d_j | \cdots) + \log p(n_j | \cdots)]$

with corresponding event, onset, duration, and note perplexities reported as $ppl(e) = \exp(L_{\mathrm{event}}/N)$ , alongside $ppl(\tilde t), ppl(d), ppl(n)$ . No auxiliary losses or discriminators are used—plain autoregressive cross-entropy suffices (Thickstun et al., 2023).

4. Dataset Processing, Model Scaling, and Training Regimes

AMT is trained on the filtered Lakh MIDI dataset (Raffel et al. 2016): 164,747 MIDI sequences, comprising 7,827h training, 555h validation, and 561h test data. Preprocessing steps include:

Exclusion of sequences $<$ 10s or $>$ 1h and those with $>$ 16 instruments.
Sparsity correction: REST tokens are inserted every 1s between sparse events to prevent the anticipation interval ( $\delta$ up to 5s) from overshooting the target time.
Tokenization: Arrival-time encoding yields a 27,512-token vocabulary, including REST/SEP/control tokens.

Model configurations are as follows:

Scale	Params	Layers (L)	Heads (H)	Dim (d)
Small	128 M	12	12	768
Medium	360 M	24	16	1024
Large	780 M	36	20	1280

Optimization uses AdamW ( $\beta_1 = 0.9, \beta_2 = 0.999, \epsilon = 1\text{e}{-8}$ ), weight decay $0.1$, dropout $0.1$, gradient clipping ( $\|g\| \leq 1$ ), with a linear warmup followed by cosine decay on the learning rate ( $l_{\max} = 6\text{e}{-4}, 3\text{e}{-4}, 2\text{e}{-4}$ for Small, Medium, Large). Training utilizes 512 sequences per step (context length 1024), on TPU v3–32 slices. Throughput is $\sim$ 0.7M tok/s (Small), 0.26M (Medium), 0.14M (Large). Training steps vary by model and experiment; refer to Table 5.1 in (Thickstun et al., 2023).

5. Quantitative and Qualitative Evaluation

Quantitative evaluation on the test set without anticipation ( $z=\text{AR}$ ) uses bits/sec (b/s) and event perplexity. Representative results:

Config	Params	Encoding	AM?	b/s	ppl(e)	ppl( $\tilde t$ )	ppl(d)	ppl(n)
Small interarrival	112 M	interarrival	—	85.9	—	—	—	—
Small arrival (100k)	128 M	arrival	AAR	80.7	15.0	1.58	3.98	2.40
Small arrival (800k)	128 M	arrival	AAR	75.0	12.4	1.52	3.64	2.24
Medium (800k)	360 M	arrival	AAR	69.7	10.4	1.49	3.29	2.12

Findings:

Arrival-time encoding consistently outperforms interarrival encoding.
Anticipation (AAR) introduces negligible overhead on unconditional log-loss; overhead vanishes with longer training.
Scaling model size and training duration reduces perplexity.

Human evaluation via pairwise judgments was conducted on Amazon Mechanical Turk with 20 s synthesized audio clips (nucleus sampling $p=0.95$ ):

Prompt continuation experiments (3-bar prompt + 14 s continuation):

Human vs. Medium: humans slightly preferred ( $p=0.0027$ ).
Medium vs. Figaro baseline: Medium strongly preferred ( $p < 1\text{e}{-8}$ ).
Small arrival vs. Small interarrival: Small arrival preferred ( $p=0.007$ ).

Accompaniment experiments (5 s melody prompt + 15 s accompaniment):

Human vs. AAR: no significant preference ( $p=0.19$ ).
Human vs. AR baseline: human overwhelmingly preferred ( $p < 1\text{e}{-8}$ ).
AAR vs. AR baseline: AAR vastly preferred ( $p < 1\text{e}{-7}$ ).
AAR vs. random retrieval: AAR strongly preferred ( $p < 1\text{e}{-8}$ ).

These results indicate that AMT can produce accompaniments with musicality rivalling human-composed samples over 20-second timescales (Thickstun et al., 2023).

6. Functional Advantages and Significance

AMT’s key contribution is a principled method—anticipation—for $\delta$ -second-ahead conditioning on asynchronous controls, achieved via data-side modifications alone. No structural or algorithmic changes to Transformer architectures are necessary, retaining compatibility with standard frameworks and optimization techniques. This construction enables tractable infilling and accompaniment tasks in symbolic music generation, matching or exceeding standard autoregressive baselines in both unconditional and conditional scenarios. Empirical results demonstrate negligible performance penalty for anticipation and strong human preference for AMT-generated accompaniments when compared to baselines.

A plausible implication is that anticipation as a general mechanism may extend to controllable sequence modeling outside music, whenever asynchronous or infilling control is desired within causal autoregressive models (Thickstun et al., 2023).

Markdown Report Issue Upgrade to Chat

References (1)

Anticipatory Music Transformer (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Anticipatory Music Transformer (AMT).