Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 82 tok/s

Gemini 2.5 Pro 48 tok/s Pro

GPT-5 Medium 36 tok/s Pro

GPT-5 High 32 tok/s Pro

GPT-4o 110 tok/s Pro

Kimi K2 185 tok/s Pro

GPT OSS 120B 456 tok/s Pro

Claude Sonnet 4.5 34 tok/s Pro

2000 character limit reached

Predicting the Order of Upcoming Tokens Improves Language Modeling (2508.19228v1)

Published 26 Aug 2025 in cs.LG

Abstract: Multi-Token Prediction (MTP) has been proposed as an auxiliary objective to improve next-token prediction (NTP) in LLM training but shows inconsistent improvements, underperforming in standard NLP benchmarks. We argue that MTP's exact future token prediction is too difficult as an auxiliary loss. Instead, we propose Token Order Prediction (TOP), which trains models to order upcoming tokens by their proximity using a learning-to-rank loss. TOP requires only a single additional unembedding layer compared to MTP's multiple transformer layers. We pretrain models of 340M, 1.8B, and 7B parameters using NTP, MTP, and TOP objectives. Results on eight standard NLP benchmarks show that TOP overall outperforms both NTP and MTP even at scale. Our code is available at https://github.com/zaydzuhri/token-order-prediction

Summary

The paper demonstrates that TOP improves language modeling by ranking upcoming tokens based on their proximity, outperforming traditional next-token and multi-token predictions.
The authors modify the architecture minimally with a single unembedding layer and a custom Triton kernel, ensuring efficient training and scalability.
Experimental results reveal higher benchmark scores and lower perplexity across datasets, indicating enhanced generalization and parameter efficiency.

Token Order Prediction (TOP): A Scalable Auxiliary Objective for LLMing

Introduction and Motivation

The paper introduces Token Order Prediction (TOP) as a novel auxiliary objective for LLM pretraining, addressing the limitations of Multi-Token Prediction (MTP). While MTP augments next-token prediction (NTP) with additional heads to predict future tokens, it suffers from scalability issues, inconsistent improvements on standard NLP benchmarks, and increased architectural complexity. The authors hypothesize that the difficulty of exact future token prediction in MTP impedes its effectiveness, especially for small models and large look-ahead windows. TOP is proposed as a more tractable alternative, leveraging a learning-to-rank loss to order upcoming tokens by proximity, thus relaxing the prediction task and reducing architectural overhead.

Figure 1: An overview of Token Order Prediction (TOP), illustrating the construction of the TOP target sequence and the dual unembedding heads for NTP and TOP.

Methodology

Token Order Prediction (TOP) Formulation

TOP constructs a target sequence for each input position, assigning a score to each vocabulary token based on its proximity of first appearance within a fixed window. The scoring function is implemented efficiently via a custom Triton kernel, enabling on-the-fly target generation with negligible overhead. The model architecture is minimally modified: a single additional linear unembedding layer (TOP head) is appended in parallel to the NTP head, both operating on the final transformer hidden state.

The TOP loss is defined using a listwise learning-to-rank objective, specifically the ListNet loss:

$\mathcal{L}_{TOP} = -\sum_{t=0}^{T} \text{softmax}(y_t) \cdot \log(\text{softmax}(u_{TOP}(h^L_t)))$

where $y_t$ is the proximity score vector for position $t$ , $u_{TOP}$ is the TOP unembedding layer, and $h^L_t$ is the final hidden state. The total training loss is the sum of the NTP and TOP losses:

$\mathcal{L} = \mathcal{L}_{NTP} + \mathcal{L}_{TOP}$

This approach avoids the need for additional transformer layers per future token, as required by MTP, and scales efficiently with window size.

Comparison to Multi-Token Prediction (MTP)

MTP introduces $N$ parallel transformer heads, each predicting a specific future token offset, with the loss:

$\mathcal{L}_{MTP} = -\sum_{t=0}^{T}\sum_{n=1}^{N} \log(P_\theta(x_{t+n}|x_{0:t}))$

Empirical evidence shows that MTP's loss increases and converges more slowly for tokens further in the future, indicating the increased difficulty of the task.

Figure 2: Training loss of a small MTP transformer model with 16 MTP heads, demonstrating the increasing difficulty of predicting tokens at larger offsets.

TOP circumvents this by focusing on relative ordering rather than exact prediction, making the auxiliary objective more learnable and less sensitive to model size and window hyperparameters.

Experimental Results

Models of sizes 340M, 1.8B, and 7B were pretrained on the FineWeb-Edu dataset using NTP, MTP, and TOP objectives. Evaluation on eight standard NLP benchmarks (ARC Challenge, Lambada, PIQA, SciQ, Social IQa, TriviaQA, NaturalQuestions Open, HellaSwag) demonstrates that TOP consistently outperforms both NTP and MTP across most tasks and scales.

Key findings include:

TOP achieves higher benchmark scores and lower perplexity than NTP and MTP, even as model size increases.
MTP shows competitive results for small models but underperforms at 7B parameters on non-coding tasks.
TOP exhibits higher NTP training loss but better generalization, suggesting a regularization effect.
TOP's architectural simplicity (single unembedding layer) enables efficient scaling and minimal compute overhead.

These results indicate that TOP is a more effective and scalable auxiliary objective for general LLMing compared to MTP.

Practical and Theoretical Implications

The introduction of TOP has several practical advantages:

Parameter Efficiency: Only one additional unembedding layer is required, regardless of window size, in contrast to MTP's multiple transformer layers.
Scalability: TOP scales well with model size and window hyperparameters, making it suitable for large-scale LLM pretraining.
Generalization: The regularization effect observed with TOP suggests improved generalization and reduced overfitting, particularly on limited datasets.
Inference Compatibility: The TOP head is removed at inference, preserving the standard transformer architecture and generation capabilities.

Theoretically, TOP reframes auxiliary objectives in LLMing from exact prediction to ranking, aligning with advances in learning-to-rank literature. This relaxation of the prediction task may facilitate the learning of richer internal representations, potentially benefiting downstream tasks that require contextual reasoning and look-ahead capabilities.

Future Directions

The paper outlines several avenues for further research:

Comparative analysis with DeepSeek V3's sequential MTP variant.
Evaluation of TOP on generative tasks (summarization, coding) and synthetic reasoning benchmarks (star graph problem).
Investigation of self-speculative decoding using TOP.
Analysis of the regularization effect and its impact on generalization.

These directions will clarify the scope and limitations of TOP and its applicability to broader LLM training regimes.

Conclusion

Token Order Prediction (TOP) is presented as a scalable, parameter-efficient auxiliary objective for LLM pretraining. By shifting from exact future token prediction to ranking upcoming tokens by proximity, TOP overcomes the limitations of MTP and demonstrates superior performance on standard NLP benchmarks across multiple model sizes. The method's architectural simplicity and empirical effectiveness suggest its potential for widespread adoption in LLM training pipelines. Future work will further elucidate its benefits and extend its evaluation to generative and synthetic reasoning tasks.