Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
130 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
36 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
4 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

Multi-Token Prediction Needs Registers (2505.10518v1)

Published 15 May 2025 in cs.CL, cs.AI, cs.CV, and cs.LG

Abstract: Multi-token prediction has emerged as a promising objective for improving LLM pretraining, but its benefits have not consistently generalized to other settings such as fine-tuning. In this paper, we propose MuToR, a simple and effective approach to multi-token prediction that interleaves learnable register tokens into the input sequence, each tasked with predicting future targets. Compared to existing methods, MuToR offers several key advantages: it introduces only a negligible number of additional parameters, requires no architectural changes--ensuring compatibility with off-the-shelf pretrained LLMs--and remains aligned with the next-token pretraining objective, making it especially well-suited for supervised fine-tuning. Moreover, it naturally supports scalable prediction horizons. We demonstrate the effectiveness and versatility of MuToR across a range of use cases, including supervised fine-tuning, parameter-efficient fine-tuning (PEFT), and pretraining, on challenging generative tasks in both language and vision domains. Our code will be available at: https://github.com/nasosger/MuToR.

Summary

An In-Depth Examination of "Multi-Token Prediction Needs Registers"

The paper presented in "Multi-Token Prediction Needs Registers" by Gerontopoulos, Gidaris, and Komodakis centers on enhancing autoregressive transformer models through a novel method, MuToR, aimed at improving multi-token prediction tasks during pretraining and fine-tuning stages. This research addresses the shortcomings of existing multi-token prediction techniques and proposes a more efficient and adaptable solution that integrates smoothly with current LLMs.

Key Innovations and Methodology

The primary proposition of this paper, MuToR, seeks to improve multi-token prediction by incorporating learnable register tokens into the LLM's input sequence. Unlike existing methods that typically require additional transformer heads and substantial architectural changes, MuToR introduces minimal additional parameters and retains compatibility with existing pretrained models. This makes MuToR particularly suitable for fine-tuning tasks without altering the core architecture.

Register tokens in MuToR serve a unique function: each register token is tasked with predicting a token multiple steps ahead, specified by a randomly assigned offset. This setup allows the model to develop a deeper understanding of longer sequences, potentially fostering better internal representations for planning and predictive tasks. A critical aspect of MuToR's design is its innovative attention masking strategy that ensures the register tokens can only rely on preceding regular tokens for predictions and are invisible to each other and regular tokens during this process. Consequently, this preserves the normal inference speed and pattern seen in standard autoregressive models.

Empirical Validation

The effectiveness of MuToR is tested across multiple challenging generative tasks, such as mathematical reasoning, abstractive summarization, and autoregressive image generation. The empirical results consistently demonstrate that MuToR surpasses traditional fine-tuning baselines and other multi-token prediction methods. In particular, MuToR achieves superior performance without the extensive computational overhead associated with additional transformer heads. Using LLMs like Gemma 2B and Llama 3 8B as baselines, MuToR shows improved exact-match accuracy in tasks like GSM8K, MATH500, and AQUA-RAT, as well as better ROUGE scores in summarization benchmarks like SAMSum and DialogSum.

MuToR also illustrates significant improvements in autoregressive image generation tasks, adapting successfully to two-dimensional data structures. By incorporating a spatially aware prediction framework, MuToR-2D (an extension of the method that exploits two-dimensional offsets) leverages these inherent spatial dependencies, showcasing notable gains in sample quality metrics such as FID and IS.

A further compelling aspect of this research is the paper of register token sparsity. By adjusting the number and positioning of registers, MuToR can scale to various computational budgets, highlighting its adaptability and efficiency. The paper reports that even with fewer register tokens, MuToR maintains competitive performance, indicating the robustness of the supervision it introduces.

Theoretical and Practical Implications

From a theoretical standpoint, MuToR challenges existing paradigms in autoregressive model training by reinforcing the use of auxiliary predictive tasks that boost the learning process without disrupting the standard operational dynamics of the models. This approach opens up new discussions around efficient supervised signal propagation within transformer architectures.

Practically, the adoption of MuToR can yield significant improvements in model performance across both linguistic and visual domains without the accompanying engineering complexity that typically hinders deployment. Its compatibility with existing infrastructure greatly reduces the barrier for integration, allowing for advancements in language and vision tasks with relative ease.

Future Directions

Building on MuToR's promising results, future research could delve into optimizing register token placement and embedding schemes, potentially leveraging deeper learning frameworks to autonomously adjust these parameters. Additionally, exploring the integration of MuToR in other domains or with other model structures could yield insights into its broader applicability and potential for cross-disciplinary advancements. Overall, the introduction of MuToR illustrates a pivotal advance in the pursuit of more efficient and powerful language and image models, marking a step forward in the continuous evolution of artificial intelligence capabilities.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com