Optimality of Next-Token Prediction for Capability Absorption

Determine whether the standard next-token prediction objective used during self-supervised pre-training on unstructured text corpora is the most optimal and efficient approach for enabling large language models to acquire task-performing capabilities via indirect supervision from examples of tasks that appear in the pre-training documents.

Background

LLMs are predominantly pre-trained with a self-supervised next-token prediction objective over massive unstructured corpora, and later adapted via instruction tuning. Prior work indicates that pre-training data can induce task capabilities through indirect supervision when tasks appear within the documents. This raises the question of whether the conventional next-token prediction objective is the best way to absorb such capabilities.

The FineInstructions pipeline proposed in the paper restructures pre-training data into synthetic instruction–answer pairs to align with downstream usage, hypothesizing improved efficiency and performance. The explicit uncertainty centers on whether next-token prediction is indeed the optimal or most efficient route for models to gain capabilities from pre-training texts, motivating investigation into alternative objectives and formats like instruction-based supervised pre-training.

References

It is unclear, however, whether predicting the next token over pre-training documents is the most optimal or efficient way for models to absorb such capabilities.

FineInstructions: Scaling Synthetic Instructions to Pre-Training Scale  (2601.22146 - Patel et al., 29 Jan 2026) in Section 1 (Introduction)