Papers
Topics
Authors
Recent
Search
2000 character limit reached

Anticipatory Music Transformer (AMT)

Updated 9 February 2026
  • Anticipatory Music Transformer (AMT) is an autoregressive model that interleaves event and control tokens, enabling controllable symbolic music generation.
  • It modifies a standard GPT-2 architecture with arrival-time tokenization and vocabulary doubling to support asynchronous control through anticipation.
  • Empirical results demonstrate that AMT maintains tractable causal generation with improved musical infilling and accompaniment quality.

The Anticipatory Music Transformer (AMT) is a class of autoregressive generative models for symbolic music that generalize the standard Transformer architecture to permit asynchronous control via a formally defined mechanism called "anticipation." This framework enables tractable, controllable generation of event sequences (such as musical notes) conditioned on arbitrary subsets of future events, a critical capability for musical infilling and accompaniment. AMT interleaves event and control tokens in a single causal stream using a precise stopping-time criterion, thus preserving both model tractability and causal generation. This approach is implemented via minor data-level modifications to a GPT-2 style Transformer, allowing standard architectures and optimization regimes to be used without core changes (Thickstun et al., 2023).

1. Anticipation and Temporal Point Process Modeling

AMT models two marked temporal point processes:

  • The event process {ej=(tj,mj)}j=1N\{e_j=(t_j, m_j)\}_{j=1}^N comprises arrival times 0t1tN0 \leq t_1 \leq \dotsc \leq t_N and marks (e.g., pitch, velocity) mjm_j from a finite vocabulary V\mathcal{V}.
  • The control process {uk=(sk,ck)}k=1K\{u_k = (s_k, c_k)\}_{k=1}^K includes control times 0s1sK0 \leq s_1 \leq \dotsc \leq s_K and control-marks ckc_k in U\mathcal{U}, often U=V\mathcal{U} = \mathcal{V} for infilling.

Anticipation enables the modeling of the joint conditional distribution p(e1:Nu1:K)p(e_{1:N} \mid u_{1:K}) by interleaving the event and control sequences using the operator interleaveδ(e1:N,u1:K)\mathrm{interleave}_\delta(e_{1:N}, u_{1:K}), producing a single sequence a1:N+Ka_{1:N+K}. Each control uku_k at time sks_k is “anticipated” into the sequence at index

τuk=k+min{j:tjskδ}\tau_{u_k} = k + \min \{j: t_j \ge s_k - \delta \}

and each event eje_j is inserted at

τej=j+max{k:tj1skδ}.\tau_{e_j} = j + \max \{k: t_{j-1} \ge s_k - \delta \}.

Here, δ>0\delta > 0 is the anticipation interval—a hyperparameter determining how far in advance controls manifest in the causal stream. The interleaving preserves the stopping-time property: τuk1\tau_{u_k} - 1 depends only on the prefix a1:i1a_{1:i-1}, ensuring causal, tractable sampling.

After interleaving, the joint is factorized autoregressively:

p(a1:N+K)=i=1N+Kp(aia1:i1),p(a_{1:N+K}) = \prod_{i=1}^{N+K} p(a_i \mid a_{1:i-1}),

and, for conditional generation, the model masks the loss on control indices during training (Thickstun et al., 2023).

2. Model Architecture and Token Sequence Construction

AMT employs a decoder-only (causally masked) Transformer architecture closely aligned with GPT-2, but introduces critical data-side modifications:

  • Arrival-time tokenization: Each event ej=(tj,dj,nj)e_j = (t_j, d_j, n_j) is quantized into three tokens—onset t~j=floor(tj/0.01)\tilde t_j = \mathrm{floor}(t_j/0.01), duration djd_j, and note njn_j. This encoding yields context-free triplet tokens (3N total) that are permutation-compatible for interleaving.
  • Vocabulary doubling (control vs event): Control and event tokens are assigned disjoint ID ranges, enabling the Transformer to distinguish “anticipated” (future control) tokens from historical event tokens of the same type. Every sequence is prepended with a single global control code z{AR,AAR}z \in \{\mathrm{AR}, \mathrm{AAR}\}, indicating unconditional or anticipatory generation mode.
  • Position embeddings: Uses standard absolute positional embeddings over sequence positions 110241\dotsc 1024. No special embeddings for anticipation or stopping-times are required.
  • Attention masking: Utilizes causal (decoder-only) masking; no need for bidirectional or custom sparse mechanisms, as asynchronous controls are already woven into the causal stream.

Other architecture components—multi-head self-attention, layer normalization, GeLU activations, and residual connections—are standard (Thickstun et al., 2023).

3. Maximum Likelihood Training and Evaluation Metrics

Training proceeds with a maximum-likelihood objective. Given a packed example x1:Mx_{1:M} (sequence of interleaved tokens plus control code),

maxθi=1Mlogpθ(xiz,x1:i1)\max_{\theta} \sum_{i=1}^M \log p_{\theta}(x_i \mid z, x_{1:i-1})

with per-token cross-entropy loss

Ltoken=i=1Mlogp(xix<i).L_{\mathrm{token}} = - \sum_{i=1}^M \log p(x_i \mid x_{<i}).

For event-wise analysis, tokens are grouped into triplets (onset, duration, note), and the event loss is:

Levent=j=1N[logp(t~j)+logp(dj)+logp(nj)]L_{\mathrm{event}} = -\sum_{j=1}^{N} [\log p(\tilde t_j | \cdots) + \log p(d_j | \cdots) + \log p(n_j | \cdots)]

with corresponding event, onset, duration, and note perplexities reported as ppl(e)=exp(Levent/N)ppl(e) = \exp(L_{\mathrm{event}}/N), alongside ppl(t~),ppl(d),ppl(n)ppl(\tilde t), ppl(d), ppl(n). No auxiliary losses or discriminators are used—plain autoregressive cross-entropy suffices (Thickstun et al., 2023).

4. Dataset Processing, Model Scaling, and Training Regimes

AMT is trained on the filtered Lakh MIDI dataset (Raffel et al. 2016): 164,747 MIDI sequences, comprising 7,827h training, 555h validation, and 561h test data. Preprocessing steps include:

  • Exclusion of sequences <<10s or >>1h and those with >>16 instruments.
  • Sparsity correction: REST tokens are inserted every 1s between sparse events to prevent the anticipation interval (δ\delta up to 5s) from overshooting the target time.
  • Tokenization: Arrival-time encoding yields a 27,512-token vocabulary, including REST/SEP/control tokens.

Model configurations are as follows:

Scale Params Layers (L) Heads (H) Dim (d)
Small 128 M 12 12 768
Medium 360 M 24 16 1024
Large 780 M 36 20 1280

Optimization uses AdamW (β1=0.9,β2=0.999,ϵ=1e8\beta_1 = 0.9, \beta_2 = 0.999, \epsilon = 1\text{e}{-8}), weight decay $0.1$, dropout $0.1$, gradient clipping (g1\|g\| \leq 1), with a linear warmup followed by cosine decay on the learning rate (lmax=6e4,3e4,2e4l_{\max} = 6\text{e}{-4}, 3\text{e}{-4}, 2\text{e}{-4} for Small, Medium, Large). Training utilizes 512 sequences per step (context length 1024), on TPU v3–32 slices. Throughput is \sim0.7M tok/s (Small), 0.26M (Medium), 0.14M (Large). Training steps vary by model and experiment; refer to Table 5.1 in (Thickstun et al., 2023).

5. Quantitative and Qualitative Evaluation

Quantitative evaluation on the test set without anticipation (z=ARz=\text{AR}) uses bits/sec (b/s) and event perplexity. Representative results:

Config Params Encoding AM? b/s ppl(e) ppl(t~\tilde t) ppl(d) ppl(n)
Small interarrival 112 M interarrival 85.9
Small arrival (100k) 128 M arrival AAR 80.7 15.0 1.58 3.98 2.40
Small arrival (800k) 128 M arrival AAR 75.0 12.4 1.52 3.64 2.24
Medium (800k) 360 M arrival AAR 69.7 10.4 1.49 3.29 2.12

Findings:

  • Arrival-time encoding consistently outperforms interarrival encoding.
  • Anticipation (AAR) introduces negligible overhead on unconditional log-loss; overhead vanishes with longer training.
  • Scaling model size and training duration reduces perplexity.

Human evaluation via pairwise judgments was conducted on Amazon Mechanical Turk with 20 s synthesized audio clips (nucleus sampling p=0.95p=0.95):

Prompt continuation experiments (3-bar prompt + 14 s continuation):

  • Human vs. Medium: humans slightly preferred (p=0.0027p=0.0027).
  • Medium vs. Figaro baseline: Medium strongly preferred (p<1e8p < 1\text{e}{-8}).
  • Small arrival vs. Small interarrival: Small arrival preferred (p=0.007p=0.007).

Accompaniment experiments (5 s melody prompt + 15 s accompaniment):

  • Human vs. AAR: no significant preference (p=0.19p=0.19).
  • Human vs. AR baseline: human overwhelmingly preferred (p<1e8p < 1\text{e}{-8}).
  • AAR vs. AR baseline: AAR vastly preferred (p<1e7p < 1\text{e}{-7}).
  • AAR vs. random retrieval: AAR strongly preferred (p<1e8p < 1\text{e}{-8}).

These results indicate that AMT can produce accompaniments with musicality rivalling human-composed samples over 20-second timescales (Thickstun et al., 2023).

6. Functional Advantages and Significance

AMT’s key contribution is a principled method—anticipation—for δ\delta-second-ahead conditioning on asynchronous controls, achieved via data-side modifications alone. No structural or algorithmic changes to Transformer architectures are necessary, retaining compatibility with standard frameworks and optimization techniques. This construction enables tractable infilling and accompaniment tasks in symbolic music generation, matching or exceeding standard autoregressive baselines in both unconditional and conditional scenarios. Empirical results demonstrate negligible performance penalty for anticipation and strong human preference for AMT-generated accompaniments when compared to baselines.

A plausible implication is that anticipation as a general mechanism may extend to controllable sequence modeling outside music, whenever asynchronous or infilling control is desired within causal autoregressive models (Thickstun et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Anticipatory Music Transformer (AMT).