Next-Token-Pair Prediction (NTPP)

Updated 25 January 2026

NTPP is a generative modeling paradigm that predicts token pairs concurrently, enhancing inference speed and parallelism over traditional autoregressive methods.
It utilizes masked input, gated LoRA adaptation, and specialized dual prediction heads to facilitate both text-based processing and dual-channel speech modeling with rigorous mathematical underpinnings.
Empirical results demonstrate significant throughput improvements and lower latencies, though challenges remain in matching the precision of marginalization-based approaches.

Next-Token-Pair Prediction (NTPP) is a generative modeling paradigm specialized for predicting two tokens simultaneously from a given context, rather than the standard sequential single-token prediction of traditional autoregressive LLMs. NTPP is leveraged for both text-based LLMs and for dual-channel speech modeling, providing a pathway toward accelerated inference, improved parallelism, and richer, speaker-independent conversational dynamics. NTPP's mathematical, architectural, and training underpinnings are precise, enabling rigorous evaluation and benchmarking in both domains (Samragh et al., 16 Jul 2025, Mehra et al., 13 Feb 2025, Wang et al., 1 Jun 2025).

1. Mathematical Foundation and Objective Formulation

NTPP formalizes the prediction of token pairs $(x_{t+1}, x_{t+2})$ given a context $\mathcal{X}_{\leq t}$ via the joint probability: $p(x_{t+1},x_{t+2} \mid \mathcal{X}_{\leq t};\theta) = p(x_{t+1} \mid \mathcal{X}_{\leq t};\theta)\;p(x_{t+2} \mid \mathcal{X}_{\leq t},x_{t+1};\theta)$ This can be generalized to $k$ tokens by recursively marginalizing over all intermediate states as in

$p(\mathcal{X}_{t:t+K} \mid \mathcal{X}_{\leq t}; \theta) = p(x_{t+1}\mid \mathcal{X}_{\leq t}) \prod_{k=2}^{K} \sum_{s_{1:k-1}\in \mathbb{V}^{k-1}} p\bigl(x_{t+k},s_{1:k-1}\mid \mathcal{X}_{\leq t}\bigr)$

(Mehra et al., 13 Feb 2025).

For dual-channel speech (SLMs), NTPP is formally defined over paired token streams $(s_t^{(A)}, s_t^{(B)})$ , factorizing as

$p(S^{(A)}, S^{(B)}) = \prod_{t=1}^T p(s_t^{(A)}, s_t^{(B)} \mid s_{<t}^{(A)}, s_{<t}^{(B)}; \theta)$

with conditional independence at each time step: $p(s_t^{(A)}, s_t^{(B)} \mid \text{context}) = p(s_t^{(A)} | \text{context}) \times p(s_t^{(B)} | \text{context})$ and the training objective

$L_{NTPP}(\theta) = -\sum_{t=1}^T \left[ \log p(s_t^{(A)} | \cdots) + \log p(s_t^{(B)} | \cdots) \right]$

(Wang et al., 1 Jun 2025).

2. Architectural Strategies for Text-Based LLMs

In text-based LLMs, NTPP is instantiated via a masked-input formulation combined with specialized prediction heads and gated LoRA adaptation (Samragh et al., 16 Jul 2025). The procedure includes:

Masked Input: Augment the context $X = [x_1, ..., x_n]$ by appending $\mathcal{X}_{\leq t}$ 0 mask tokens $\mathcal{X}_{\leq t}$ 1, producing $\mathcal{X}_{\leq t}$ 2.
Frozen-Base Transformer: The base model's parameters are frozen during fine-tuning.
Gated LoRA Modification: Low-rank adapters (A,B) are inserted in parallel to each linear layer, gated such that adaptation occurs only for mask positions. The output is

$\mathcal{X}_{\leq t}$ 3

where $\mathcal{X}_{\leq t}$ 4 is $\mathcal{X}_{\leq t}$ 5 if t is a mask, and $\mathcal{X}_{\leq t}$ 6 otherwise.

Token Prediction Heads: Split into the base unembedding head $\mathcal{X}_{\leq t}$ 7 (classic next-token prediction) and a sampler MLP head for the mask positions, which conditions on both $\mathcal{X}_{\leq t}$ 8 and the embedding of the previously sampled token $\mathcal{X}_{\leq t}$ 9.

Per-token outputs and losses are: $p(x_{t+1},x_{t+2} \mid \mathcal{X}_{\leq t};\theta) = p(x_{t+1} \mid \mathcal{X}_{\leq t};\theta)\;p(x_{t+2} \mid \mathcal{X}_{\leq t},x_{t+1};\theta)$ 0

$p(x_{t+1},x_{t+2} \mid \mathcal{X}_{\leq t};\theta) = p(x_{t+1} \mid \mathcal{X}_{\leq t};\theta)\;p(x_{t+2} \mid \mathcal{X}_{\leq t},x_{t+1};\theta)$ 1

with auxiliary latent consistency matching (LCM) to align representations.

3. Marginalization and MTP Heads in Pretrained Models

Marginalization computes the exact joint distribution by summing over all intermediate token candidates: $p(x_{t+1},x_{t+2} \mid \mathcal{X}_{\leq t};\theta) = p(x_{t+1} \mid \mathcal{X}_{\leq t};\theta)\;p(x_{t+2} \mid \mathcal{X}_{\leq t},x_{t+1};\theta)$ 2 and is the baseline for multi-token prediction quality (Mehra et al., 13 Feb 2025). For practical amortization, models append “MTP heads”—parallel transformer layers dedicated to future tokens—on top of the frozen backbone. Each head predicts $p(x_{t+1},x_{t+2} \mid \mathcal{X}_{\leq t};\theta) = p(x_{t+1} \mid \mathcal{X}_{\leq t};\theta)\;p(x_{t+2} \mid \mathcal{X}_{\leq t},x_{t+1};\theta)$ 3 via

$p(x_{t+1},x_{t+2} \mid \mathcal{X}_{\leq t};\theta) = p(x_{t+1} \mid \mathcal{X}_{\leq t};\theta)\;p(x_{t+2} \mid \mathcal{X}_{\leq t},x_{t+1};\theta)$ 4

where $p(x_{t+1},x_{t+2} \mid \mathcal{X}_{\leq t};\theta) = p(x_{t+1} \mid \mathcal{X}_{\leq t};\theta)\;p(x_{t+2} \mid \mathcal{X}_{\leq t},x_{t+1};\theta)$ 5 is the $p(x_{t+1},x_{t+2} \mid \mathcal{X}_{\leq t};\theta) = p(x_{t+1} \mid \mathcal{X}_{\leq t};\theta)\;p(x_{t+2} \mid \mathcal{X}_{\leq t},x_{t+1};\theta)$ 6-th replicated final layer and $p(x_{t+1},x_{t+2} \mid \mathcal{X}_{\leq t};\theta) = p(x_{t+1} \mid \mathcal{X}_{\leq t};\theta)\;p(x_{t+2} \mid \mathcal{X}_{\leq t},x_{t+1};\theta)$ 7 the shared unembedding.

Empirical findings indicate that while such MTP heads increase throughput, performance lags behind marginalization, especially when the backbone is strongly specialized for NTP. Joint training (with LoRA and weighted hidden states) narrows the gap but does not close it (Mehra et al., 13 Feb 2025).

4. Training Workflow and Loss Schemes

Fine-tuning protocols for text-based NTPP introduce several coordinated loss terms (Samragh et al., 16 Jul 2025):

Base loss $p(x_{t+1},x_{t+2} \mid \mathcal{X}_{\leq t};\theta) = p(x_{t+1} \mid \mathcal{X}_{\leq t};\theta)\;p(x_{t+2} \mid \mathcal{X}_{\leq t},x_{t+1};\theta)$ 8 on all tokens (unembedding head).
Sampler loss $p(x_{t+1},x_{t+2} \mid \mathcal{X}_{\leq t};\theta) = p(x_{t+1} \mid \mathcal{X}_{\leq t};\theta)\;p(x_{t+2} \mid \mathcal{X}_{\leq t},x_{t+1};\theta)$ 9 on mask tokens (sampler MLP).
Latent Consistency Matching (LCM) loss $k$ 0 aligns masked representations with their autoregressive equivalents.

Overall objective for $k$ 1: $k$ 2

For MTP head-based approaches, loss summation and differential learning rates (heads at $k$ 3 backbone LR), as well as head warmup protocols, are used to maintain stability and balance between prediction heads (Mehra et al., 13 Feb 2025).

5. Decoding Strategies and Inference Efficiency

Speculative decoding with quadratic expansion enables NTPP to maximize inference throughput (Samragh et al., 16 Jul 2025):

Beginning from verified history, two mask tokens are appended and processed to predict both verified and speculative tokens.
Quadratic verification interleaves new masks after each speculative token, preventing depletion of viable sequence branches.
Expected acceptance rate empirically approaches $k$ 4 (i.e., $k$ 5 in the $k$ 6 case), with practical speedups of $k$ 7 on code/math and $k$ 8 on chat for Tulu3-8B on standard benchmarks.

Marginalization-based approaches in next-token models, while exact, incur a significant computational overhead and scale linearly with vocabulary candidates, restricting practical acceleration unless batched with further speculative strategies (Mehra et al., 13 Feb 2025).

6. Extensions to Dual-Channel Speech Language Modeling

NTPP is also applied to dual-channel speech dialogue modeling (Wang et al., 1 Jun 2025):

Input: Two continuous speech streams (A, B) are quantized via VQ or RVQ into discrete token sequences.
Architecture: Decoder-only transformers process interleaved token pairs, using rotary positional and channel embeddings. In RVQ, cyclic depth embeddings are added.
Attention masking ensures each pair is processed independently at each time step, preventing cross-attention between speakers.
Training: A two-stage protocol—pretraining with next-token prediction, then fine-tuning with NTPP—on large-scale single-channel speech followed by paired conversational corpora (e.g., Fisher).
Inference: NTPP uses a single KVCache, maintaining sublinear inference latency and outperforming cascaded ASR→LM→TTS approaches and dual-cache models on turn-taking statistics (lower disruptions, more natural overlaps).

7. Benchmark Results, Comparisons, and Practical Considerations

Empirical results on NTPP-via-masked-input for Tulu3-8B in text generation tasks (Samragh et al., 16 Jul 2025):

Task (k=8)	Acceptance Rate (Speedup)
Math	5.22×
Code	5.35×
Chat	2.52×
Knowledge	∼2.38×

Ablations demonstrate additive gains from linear decoding, quadratic expansion, sampler head, and LCM loss components. Even low-rank ( $k$ 9) LoRA modules maintain $p(\mathcal{X}_{t:t+K} \mid \mathcal{X}_{\leq t}; \theta) = p(x_{t+1}\mid \mathcal{X}_{\leq t}) \prod_{k=2}^{K} \sum_{s_{1:k-1}\in \mathbb{V}^{k-1}} p\bigl(x_{t+k},s_{1:k-1}\mid \mathcal{X}_{\leq t}\bigr)$ 0 speedup.

For speech, turn-taking and inference efficiency metrics (per minute; NTPP vs. baselines):

Model	#IPU	#Pause	dur_IPU	dur_Pause
NTPP (T=0.5)	1.5	1.9	2.9	3.0
dGSLM	1.6	3.4	4.6	3.6
LSLM	2.2	3.6	4.1	3.4
Cascaded	4.1	7.0	4.3	5.5

NTPP achieves lower pause/gap rates and maintains sub-220 ms latency across multi-turn dialogues.

Across implementations:

NTPP introduces minimal overhead (e.g., two mask tokens, sampler head, gated LoRA modules).
All code and hyperparameters are PyTorch/LoRA compatible.
Speech models benefit from VAD-free turn-taking and robust speaker independence, enabled by the paired causal mask and unified KVCache.

8. Limitations, Open Questions, and Future Directions

Pretrained next-token models are strongly specialized for autoregressive targets; adapting them for multi-token prediction (MTP/NTPP) incurs performance degradation compared to theoretically exact marginalization (Mehra et al., 13 Feb 2025). Even joint LoRA training with weighted hidden states leaves a gap, suggesting that fully multi-token pretraining or deeper architectural changes may be required for optimal pairwise prediction.

For text-based models, extension to $p(\mathcal{X}_{t:t+K} \mid \mathcal{X}_{\leq t}; \theta) = p(x_{t+1}\mid \mathcal{X}_{\leq t}) \prod_{k=2}^{K} \sum_{s_{1:k-1}\in \mathbb{V}^{k-1}} p\bigl(x_{t+k},s_{1:k-1}\mid \mathcal{X}_{\leq t}\bigr)$ 1 is straightforward in notation and implementation, although quadratic speculative decoding trees will require careful resource management. In speech modeling, integrating NTPP into end-to-end SLMs promises further latency reductions and improved turn-taking, but challenges remain in scaling, multilingual compatibility, and long-context alignment.

A plausible implication is that NTPP will become central for efficient generative modeling in future multimodal dialog agents and real-time inference engines, contingent on further advances in backbone adaptation and parallel decoding techniques.

Markdown Report Issue Upgrade to Chat

References (3)

Your LLM Knows the Future: Uncovering Its Multi-Token Prediction Potential (2025)

On multi-token prediction for efficient LLM inference (2025)

NTPP: Generative Speech Language Modeling for Dual-Channel Spoken Dialogue via Next-Token-Pair Prediction (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Next Token-Pair Prediction (NTPP).

Next-Token-Pair Prediction (NTPP)

1. Mathematical Foundation and Objective Formulation

2. Architectural Strategies for Text-Based LLMs

3. Marginalization and MTP Heads in Pretrained Models

4. Training Workflow and Loss Schemes

5. Decoding Strategies and Inference Efficiency

6. Extensions to Dual-Channel Speech Language Modeling

7. Benchmark Results, Comparisons, and Practical Considerations

8. Limitations, Open Questions, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Next-Token-Pair Prediction (NTPP)

1. Mathematical Foundation and Objective Formulation

2. Architectural Strategies for Text-Based LLMs

3. Marginalization and MTP Heads in Pretrained Models

4. Training Workflow and Loss Schemes

5. Decoding Strategies and Inference Efficiency

6. Extensions to Dual-Channel Speech Language Modeling

7. Benchmark Results, Comparisons, and Practical Considerations

8. Limitations, Open Questions, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research