Leap Multi-Token Prediction (L-MTP)

Updated 3 December 2025

L-MTP is a family of sequence modeling techniques that predicts non-adjacent tokens, broadening contextual coverage and mitigating autoregressive bottlenecks.
It integrates leap heads, backward-filling decoders, and gated LoRA adapters to efficiently generate tokens and reduce error accumulation.
Empirical results demonstrate significant speed-ups and improved performance on benchmarks in code, math, and speech, highlighting its practical advantages.

Leap Multi-Token Prediction (L-MTP) is a family of sequence modeling techniques that generalize next-token prediction and conventional multi-token prediction (MTP) by enabling models to predict multiple, potentially non-adjacent future tokens in each forward pass. L-MTP combines architectural modifications, alternate supervision objectives, and specialized decoding strategies to overcome the inherent sequential bottlenecks and short-horizon myopia of autoregressive LLMs. By strategically "leaping" over intermediate positions during both training and inference, L-MTP methods promote broader contextual coverage, accelerate generation, and can enhance long-range reasoning, algorithmic generalization, and creative planning capability.

1. Principle and Taxonomy

L-MTP extends the blockwise multi-token prediction paradigm by allowing each prediction head to target a distinct, non-adjacent token offset from the input context, that is, for a shared trunk encoding $z_t$ of input $x_{1..t}$ , the $i$ -th head outputs $p(x_{t+k\cdot(i-1)+1} | z_t)$ where $k$ is the "leap stride" (Liu et al., 23 May 2025). Standard MTP corresponds to the special case $k=1$ . Compare next-token prediction (NTP), which is strictly sequential (one token per pass), and blockwise MTP, which generates $n$ adjacent tokens per step. L-MTP generalizes this to $n$ non-sequential positions, widening context exposure.

Table: Token targets for various paradigms

Method	Token positions per step	Typical stride ( $k$ )
NTP	$t+1$	1
MTP	$t+1, t+2, ..., t+n$	1
L-MTP	$t+1, t+1+k, ..., t+1+k(n-1)$	$k \geq 1$

This leap-based mechanism offers both theoretical and empirical advantages for speeding up inference, uncovering longer-range dependencies, and mitigating error accumulation (Liu et al., 23 May 2025, Samragh et al., 16 Jul 2025, Gloeckle et al., 30 Apr 2024).

2. Architectural Variants

Core L-MTP instantiations employ a shared Transformer block ("trunk") with multiple output heads:

Leap heads: Each head is configured to predict a non-adjacent token offset, determined by the stride $k$ .
Backward-filling cache: Non-leap positions are filled from cached previous predictions, using a backward-filling schedule (Liu et al., 23 May 2025).
Masked token framing (for speculative decoding): Introduce $k$ learned mask embeddings appended to each sequence, enabling autoregressive LLMs to predict several future tokens in parallel (Samragh et al., 16 Jul 2025).
Gated low-rank adapters (LoRA): During fine-tuning, gated LoRA adapters activate for masked ("future") positions only, preserving the backbone's NTP pathway (Samragh et al., 16 Jul 2025).

In speech modeling, MTP (and by extension L-MTP) employs stacks of causal Transformer layers, each with its own hidden state and output head, ensuring temporal dependencies (Wang et al., 5 Apr 2025).

3. Objective Functions and Training Schedules

L-MTP generally adopts a multi-token cross-entropy loss targeting leap positions:

$L_{\rm L-MTP} = -\sum_t \sum_{i=1}^{n} \log p_\theta(x_{t+k\cdot(i-1)+1} | z_{≤t})$

Recent implementations utilize two-stage training:

Head warm-up: Freeze backbone, train new leap heads via self-distillation.
Full model tuning: Jointly update backbone and heads, balancing next-token and multi-leap losses via a hyperparameter $\beta$ (Liu et al., 23 May 2025).

Mask-based L-MTP may also use auxiliary objectives:

Latent Consistency Matching (LCM): Enforces alignment between leap-token hidden states and autoregressive references.
Loss weighting: Decay loss contribution for longer leap distances (e.g., exponential decay, $\alpha^{k-1}$ ) (Wang et al., 5 Apr 2025).

Alternatives such as teacherless multi-token training (global objective over non-autoregressive prefixes) and discrete diffusion (reverse denoising over entire output sequence) amplify diversity and planning capabilities (Nagarajan et al., 21 Apr 2025).

4. Decoding Strategies and Inference Acceleration

L-MTP unlocks throughput gains by widening the prediction horizon:

Backward-filling decoding: After generating leap tokens, gaps are backfilled from previous cache states, reducing redundant passes (Liu et al., 23 May 2025).
Speculative blockwise decoding: Model outputs $k$ candidate tokens; each speculative token is verified via autoregressive comparison. Advanced variants deploy quadratic decoding trees to maintain high acceptance rates (Samragh et al., 16 Jul 2025).
Streaming chunked attention: In speech settings, each MTP/L-MTP pass attends only to a bounded history window, enabling real-time generation (Wang et al., 5 Apr 2025).

Theoretical analysis shows that with proper attenuation and consistency assumptions, L-MTP's accepted token length per iteration $E[L]_{\rm leap}$ dominates that of vanilla MTP as $n$ grows (explicit theorem in (Liu et al., 23 May 2025)).

Empirical speed-up is proportional to stride and number of heads; for instance, $n=4, k=2$ yields $7$ tokens/step ( $75\%$ speed-up over NTP, $40\%$ over MTP) (Liu et al., 23 May 2025); $k=8$ mask heads provide up to $5\times$ AR throughput (Samragh et al., 16 Jul 2025).

5. Empirical Performance and Analysis

L-MTP consistently matches or outperforms NTP and adjacent-block MTP baselines on code, mathematics, general knowledge, and speech benchmarks (Liu et al., 23 May 2025, Samragh et al., 16 Jul 2025, Gloeckle et al., 30 Apr 2024, Wang et al., 5 Apr 2025). Representative results:

Method	Benchmark	Accuracy/Pass@1 (LLama 3.2-3B)
NTP	GSM8K	3.71
MTP	GSM8K	3.87
L-MTP	GSM8K	5.91

In code/math tasks: L-MTP achieves $1.5$– $5\times$ decoding speed-ups with no quality regression (Samragh et al., 16 Jul 2025, Gloeckle et al., 30 Apr 2024).
Algorithmic generalization: Multi-token methods facilitate induction heads, improve in-context reasoning, and substantially enhance creative planning and global latent variable resolution in synthetic tasks (Gloeckle et al., 30 Apr 2024, Nagarajan et al., 21 Apr 2025).
Speech generation: Three- to five-fold speed-up with negligible degradation in WER and MOS (Wang et al., 5 Apr 2025).
Creativity and diversity: Teacherless or diffusion-style L-MTP plus seed-conditioning yields up to $5\times$ boost in algorithmic novelty and reduces memorization compared to standard NTP (Nagarajan et al., 21 Apr 2025).

6. Practical Considerations and Limitations

Model size and overhead: Adding leap/extra heads increases size slightly, requiring careful parameter balancing.
Stride/head selection ( $k, n$ ): Wider leap improves speed, but excessive stride attenuates prediction confidence. Empirically, $k=2, n=4$ balances accuracy and signal (Liu et al., 23 May 2025).
Acceptance rate vs. horizon: Speculative acceptance drops for large $k$ or unpredictable text; falls back to AR calls in creative settings (Samragh et al., 16 Jul 2025).
Training requirements: Head warm-up and sufficient self-distilled data are necessary for leap heads to function reliably.
Streaming or real-time generation: Chunked attention masks support latency-sensitive tasks (Wang et al., 5 Apr 2025).
Applicability: Benefits scale with model capacity and dataset size; marginal on pure multiple-choice benchmarks.

7. Extensions, Future Directions, and Context

Research highlights several avenues for advancing L-MTP:

Adaptive stride/head scheduling: Dynamically choose leap parameters based on prediction confidence or entropy (Liu et al., 23 May 2025).
Integration with RLHF: Align leap decisions with end-task rewards.
Compression: Combine with quantization/pruning/mixture-of-experts to mitigate head overhead.
Pretraining with leap objectives: Models trained from scratch may internalize broader planning (Samragh et al., 16 Jul 2025).
Seed-conditioning for creative diversity: Hash-conditioned input (input noise) enables coherent planning and high diversity, surpassing temperature sampling (Nagarajan et al., 21 Apr 2025).
Non-autoregressive and diffusion hybrids: Global sequence or iterative denoising further break the AR bottleneck (Nagarajan et al., 21 Apr 2025).

L-MTP is situated among a broader ecosystem of alternatives to next-token prediction, including plan-then-generate, latent reasoning, continuous generation, and non-Transformer architectures (Wyatt et al., 29 Sep 2025). Collectively, these trends reflect an emerging consensus that sequential, strictly local generation is insufficient for scalable, efficient, and creative language modeling.

References:

"L-MTP: Leap Multi-Token Prediction Beyond Adjacent Context for LLMs" (Liu et al., 23 May 2025)
"Your LLM Knows the Future: Uncovering Its Multi-Token Prediction Potential" (Samragh et al., 16 Jul 2025)
"VocalNet: Speech LLM with Multi-Token Prediction for Faster and High-Quality Generation" (Wang et al., 5 Apr 2025)
"Better & Faster LLMs via Multi-token Prediction" (Gloeckle et al., 30 Apr 2024)
"Roll the dice & look before you leap: Going beyond the creative limits of next-token prediction" (Nagarajan et al., 21 Apr 2025)
"Alternatives To Next Token Prediction In Text Generation - A Survey" (Wyatt et al., 29 Sep 2025)