Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 78 tok/s

Gemini 2.5 Pro 58 tok/s Pro

GPT-5 Medium 35 tok/s Pro

GPT-5 High 28 tok/s Pro

GPT-4o 78 tok/s Pro

Kimi K2 218 tok/s Pro

GPT OSS 120B 465 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

Pre-Training Curriculum for Multi-Token Prediction in Language Models (2505.22757v1)

Published 28 May 2025 in cs.CL and cs.AI

Abstract: Multi-token prediction (MTP) is a recently proposed pre-training objective for LLMs. Rather than predicting only the next token (NTP), MTP predicts the next $k$ tokens at each prediction step, using multiple prediction heads. MTP has shown promise in improving downstream performance, inference speed, and training efficiency, particularly for large models. However, prior work has shown that smaller LLMs (SLMs) struggle with the MTP objective. To address this, we propose a curriculum learning strategy for MTP training, exploring two variants: a forward curriculum, which gradually increases the complexity of the pre-training objective from NTP to MTP, and a reverse curriculum, which does the opposite. Our experiments show that the forward curriculum enables SLMs to better leverage the MTP objective during pre-training, improving downstream NTP performance and generative output quality, while retaining the benefits of self-speculative decoding. The reverse curriculum achieves stronger NTP performance and output quality, but fails to provide any self-speculative decoding benefits.