Training LLMs Beyond Next Token Prediction -- Filling the Mutual Information Gap

Published 31 Oct 2025 in cs.CL and cs.AI | (2511.00198v1)

Abstract: Optimizing training performance in LLMs remains an essential challenge, particularly in improving model performance while maintaining computational costs. This work challenges the conventional approach of training LLMs using next-token prediction (NTP), arguing that by predicting information-rich tokens during training, there is a more effective way to train LLMs. We investigate the impact of the proposed solution in three kinds of tasks for LLMs: arithmetic, multi-label classification of text, and natural-language generation. This work offers a principled approach to optimizing LLM training, advancing both model performance and theoretical understanding of the target-token selection strategies.