Direct Multi-Token Decoding (DMTD)

Updated 19 February 2026

Direct Multi-Token Decoding (DMTD) is a paradigm that predicts multiple tokens in one inference cycle by exploiting internal model properties and self-distilled objectives.
It leverages strategies like layer reuse, multi-token output heads, and probabilistic circuits to balance generation speed and accuracy, achieving up to 2–5× throughput gains.
Empirical studies show that, with appropriate fine-tuning, DMTD maintains high fidelity while significantly reducing inference time in large language models.

Direct Multi-Token Decoding (DMTD) is an inference paradigm for accelerating autoregressive model generation—including LLMs and codec-based sequence models—by producing multiple output tokens in a single inference cycle, rather than one at a time. Unlike speculative decoding, which leverages external draft/verifier models, or traditional blockwise generation strategies that rely on auxiliary verification, DMTD fundamentally exploits internal model properties, novel self-distilled objectives, or circuit-based multi-token parameterizations. Empirical results indicate that DMTD, when appropriately tuned, can achieve up to 2–5× throughput over conventional single-token decoding with minimal accuracy loss, and in certain regimes, lossless decoding using verification-based admission criteria (Luo et al., 13 Oct 2025, Kirchenbauer et al., 5 Feb 2026, Qin et al., 2024, Grivas et al., 14 Nov 2025, Hu et al., 16 Feb 2025, Nguyen et al., 2024).

1. Problem Definition and Theoretical Rationale

Autoregressive sequence models predict a sequence of target tokens $\{x_1, \dots, x_N\}$ by factorizing the joint as $p(x_{1:N}|\text{context}) = \prod_{t=1}^N p(x_t|x_{<t}, \text{context})$ . Standard decoding emits one token per forward pass. DMTD generalizes this by positing that, after sufficient processing of the current context, a model can predict a block of $k$ future tokens at each step, often by reusing only a subset of layers or leveraging explicit multi-token output heads.

For decoder-only Transformers, a key empirical observation is that early layers encode context, middle layers perform abstraction (“thinking”), and late layers map these representations to output tokens (“decoding”). DMTD exploits this structure by hypothesizing that, once context and abstraction are computed, the final layers have enough information to autoregressively predict several contiguous tokens by repeated reuse (Luo et al., 13 Oct 2025).

In formal blockwise decoding (MTJD), the block $(x_{t+1},...,x_{t+k})$ is generated by maximizing the multi-step product $p(x_{t+1:t+k}|x_{\leq t}) = \prod_{j=1}^{k} p(x_{t+j}|x_{\leq t+j-1})$ (Qin et al., 2024). For practical efficiency, DMTD algorithms approximate or bypass this expensive search through parameter tying, draft/model distillation, or circuit-based factoring.

2. DMTD Methods: Self-Contained and Hybrid Paradigms

There are two principal DMTD strategies:

Layer-reuse DMTD (Luo et al., 13 Oct 2025): After one full processing cycle, only the late transformer layers are recycled for subsequent token steps in a fixed-length cycle, eliminating the need to traverse the entire stack per token. Training employs cyclic masking to unify multi-token targets into one forward pass. No new parameters or architectural changes are needed; only the training objective is modified.
Standalone Multi-Token Heads/Distilled Blocks (Kirchenbauer et al., 5 Feb 2026, Nguyen et al., 2024): The model is fine-tuned (often by self-distillation) to produce $k$ -step-ahead predictions in a single pass. For LLMs, this is achieved by inputting a prefix and $k-1$ mask tokens, and training a shared multi-token output head under a joint loss against the chain-rule-computed likelihoods provided by a frozen teacher network.
Joint Distribution and Probabilistic Circuit Approach (Grivas et al., 14 Nov 2025): Expressive multi-token models replaced factorized multi-heads with structured probabilistic circuits (e.g., mixture, hidden Markov, or tree-structured sum-product networks) parameterized atop backbone transformer hidden states, permitting tractable, exact sampling and normalization over token blocks.
Speculative-Verification Hybrids (MTAD, GRIFFIN, etc.) (Qin et al., 2024, Hu et al., 16 Feb 2025): DMTD is combined with a verifier-based protocol. Either a lightweight assistant/draft model proposes $k$ tokens, which are then accepted to the extent their joint likelihood matches the main model, or architectural and training improvements (e.g., token-alignable drafts, loss masking on misaligned trajectories) maximize average acceptance lengths per speculative round.

3. Mathematical Formulations

Representative mathematical constructs for DMTD include:

Layer-Reuse Cyclic Pattern (Luo et al., 13 Oct 2025):

Let $L = L_e + L_t + L_d$ (early, thinking, decoding layers). For block length $\tau$ :

$p(x_{1:N}|\text{context}) = \prod_{t=1}^N p(x_t|x_{<t}, \text{context})$ 0

Lower PLT implies higher amortization of compute.

Multi-Token Distillation Loss (Kirchenbauer et al., 5 Feb 2026):

Student model with prefix $p(x_{1:N}|\text{context}) = \prod_{t=1}^N p(x_t|x_{<t}, \text{context})$ 1 and $p(x_{1:N}|\text{context}) = \prod_{t=1}^N p(x_t|x_{<t}, \text{context})$ 2 mask tokens predicts logits $p(x_{1:N}|\text{context}) = \prod_{t=1}^N p(x_t|x_{<t}, \text{context})$ 3. The loss is

$p(x_{1:N}|\text{context}) = \prod_{t=1}^N p(x_t|x_{<t}, \text{context})$ 4

where $p(x_{1:N}|\text{context}) = \prod_{t=1}^N p(x_t|x_{<t}, \text{context})$ 5, and $p(x_{1:N}|\text{context}) = \prod_{t=1}^N p(x_t|x_{<t}, \text{context})$ 6 is the product of student marginals over the block.

Probabilistic Circuit Parameterization (Grivas et al., 14 Nov 2025):

Joint block scores $p(x_{1:N}|\text{context}) = \prod_{t=1}^N p(x_t|x_{<t}, \text{context})$ 7 parameterized by a circuit; normalized probability

$p(x_{1:N}|\text{context}) = \prod_{t=1}^N p(x_t|x_{<t}, \text{context})$ 8

enables tractable evaluation and sampling for various dependency structures (fully factorized, mixtures, HMMs, tree-based, etc.)

4. Implementation and Inference Procedures

Layer-Reuse DMTD (Luo et al., 13 Oct 2025): Each cycle generates one token with the full stack, then $p(x_{1:N}|\text{context}) = \prod_{t=1}^N p(x_t|x_{<t}, \text{context})$ 9 tokens using only late layers, followed by cyclical refilling of early/middle layer caches for the next cycle. See detailed pseudocode in the source; cyclical masking enables training a single model for variable $k$ 0.
Multi-Token Head Inference (Kirchenbauer et al., 5 Feb 2026, Nguyen et al., 2024): At each decoding step, append $k$ 1 mask tokens, perform a single forward pass, output $k$ 2 token predictions, remove consumed mask tokens, and repeat. Adaptive block length (by confidence threshold) is supported.
Speculative/Multi-Token Assisted Decoding (Qin et al., 2024, Hu et al., 16 Feb 2025): An assistant model drafts a candidate block; the primary model (verifier) re-scores and accepts the longest matching prefix. GRIFFIN introduces token-alignable training (loss masking of off-trajectory tokens and self-conditioned drafts) to maximize acceptance rates.
Probabilistic Circuit Decoding (Grivas et al., 14 Nov 2025): Drafts token blocks by sampling through the PC head, verifies via AR head, updates the context according to acceptance.
Codec/Audio Applications (Nguyen et al., 2024): Multiple output heads predict future codec tokens in parallel per context. A Viterbi-style search restores first-order dependencies among predictions.

5. Empirical Performance and Trade-Offs

A summary of key empirical outcomes from different DMTD regimes:

Method & Setting	Throughput Gain	Accuracy/Quality Drop	Block Size / Speedup Limit	Comments
DMTD (layer-reuse, Qwen3-4B) (Luo et al., 13 Oct 2025)	Up to 2.15× ( $k$ 3)	≤3.7% (τ=4 vs. vanilla)	τ>4: quality degrades	No auxiliary models needed
Multi-token head (Distill.) (Kirchenbauer et al., 5 Feb 2026)	2–5× (eff. k≈3–3.5)	<5% (ConfAdapt)	High entropy → smaller k	Masking, no verifier
MTJD (joint, intractable) (Qin et al., 2024)	Not practical	Best PPL		V
MTAD (assisted, Llama-2-13B) (Qin et al., 2024)	2.2–2.8×	20–30% lower PPL	B=8, τ=0.1	Outperforms speculative
Probabilistic Circuit (EvaByte) (Grivas et al., 14 Nov 2025)	4.5–5.1×	No loss (verifiered)	n=8–16, r=32	Best acceptance rates (LoRA)
GRIFFIN (7B–70B LLMs) (Hu et al., 16 Feb 2025)	3.1–4.5×	Lossless	τ≈5–6	Accept rates +13% vs. prior

A core limitation is that as block (cycle) size increases, coherence and fidelity degrade on current-scale models or with limited fine-tuning; τ=3–4 is sustainable with <5% loss in most settings. Larger models exhibit better anticipatory capacity, extending τ at high quality (Luo et al., 13 Oct 2025, Hu et al., 16 Feb 2025). Self-distillation and alignment-centric approaches (e.g., GRIFFIN) improve acceptance and speed, especially in deep speculative protocols.

6. Model Architecture, Training, and Expressiveness

DMTD can be realized via several architectural and training variants:

Transformer Layer-Reuse: Partitioning layers and cyclically masking input positions allows fine-tuning any decoder-only model for DMTD without architectural changes. No extra parameters are introduced (Luo et al., 13 Oct 2025).
Multi-Token Output Heads: Simple instantiation via $k$ 4 independent heads or a single blockwise head. In codec models, $k$ 5 output projections predict $k$ 6 future tokens in parallel with one hidden encoding step (Nguyen et al., 2024).
Probabilistic Circuits (PCs): PCs encompass product-form, mixture, HMM, and tree-structured models for blockwise joint prediction, parameterized as computational graphs over token sequences, and can be combined with LoRA adapters for parameter-efficient fine-tuning (Grivas et al., 14 Nov 2025).
Token-Alignable Draft/Training (GRIFFIN): Loss masking prohibits learning on highly misaligned tokens, and architectural modules (TGF, TEH) inject token information to reduce context drift, empirically increasing block acceptance length and speed (Hu et al., 16 Feb 2025).
Online Self-Distillation: Masks and batched blocks during fine-tuning train the model to recover the chain-rule block distribution under the original AR head, resulting in robust multi-token block emission (Kirchenbauer et al., 5 Feb 2026).
Audiocodecs: Multi-head outputs on shared encoder, plus lightweight Viterbi to correct dependencies, enables 4–5× speed-up in speech decoding with no perceptual loss (Nguyen et al., 2024).

7. Limitations, Variants, and Future Directions

Principal bottlenecks for DMTD include bounded context anticipation in current model scales (τ > 4–6 degrades rapidly without massive fine-tuning), increased engineering complexity from cache refilling and block masking, and—in speculative or verification-based variants—the additional training or memory budgets for draft/assistant models.

Future directions and plausible improvements:

Full-scale continual pretraining under the cyclical/blockwise DMTD objectives to increase sustainable τ in large LLMs (Luo et al., 13 Oct 2025).
Dynamic block sizing based on uncertainty estimates, context entropy, or resource budgets (Nguyen et al., 2024, Luo et al., 13 Oct 2025).
Hybrid DMTD–Speculative Decoding with circuit or draft models, harnessing partial layer trunk-sharing and joint losses (Grivas et al., 14 Nov 2025, Qin et al., 2024).
DMTD for mixture-of-experts and sparsely-activated models, leveraging memory and compute structure.
Hardware-software co-design to accelerate frequent cyclical cache recomputation and KV management, maximizing throughput in memory-bound scenarios.
Advanced draft architectures (e.g., tree/circuit-based, token-enhanced) to narrow the train-infer gap and boost acceptance rates in speculative regimes (Hu et al., 16 Feb 2025).

DMTD frameworks collectively suggest new architectural principles for LLMs: co-training for multi-token windows, native multi-block heads, and tractable blockwise circuits could yield significant gains in both throughput and single-token fidelity as LLMs scale further (Luo et al., 13 Oct 2025, Grivas et al., 14 Nov 2025, Hu et al., 16 Feb 2025).

Markdown Report Issue Upgrade to Chat

References (6)

Direct Multi-Token Decoding (2025)

Multi-Token Prediction via Self-Distillation (2026)

Optimized Multi-Token Joint Decoding with Auxiliary Model for LLM Inference (2024)

Fast and Expressive Multi-Token Prediction with Probabilistic Circuits (2025)

GRIFFIN: Effective Token Alignment for Faster Speculative Decoding (2025)

Accelerating Codec-based Speech Synthesis with Multi-Token Prediction and Speculative Decoding (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Direct Multi-Token Decoding (DMTD).

Direct Multi-Token Decoding (DMTD)

1. Problem Definition and Theoretical Rationale

2. DMTD Methods: Self-Contained and Hybrid Paradigms

3. Mathematical Formulations

4. Implementation and Inference Procedures

5. Empirical Performance and Trade-Offs

6. Model Architecture, Training, and Expressiveness

7. Limitations, Variants, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Direct Multi-Token Decoding (DMTD)

1. Problem Definition and Theoretical Rationale

2. DMTD Methods: Self-Contained and Hybrid Paradigms

3. Mathematical Formulations

4. Implementation and Inference Procedures

5. Empirical Performance and Trade-Offs

6. Model Architecture, Training, and Expressiveness

7. Limitations, Variants, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research