Optimality of Next-Token Prediction for Capability Absorption
Determine whether the standard next-token prediction objective used during self-supervised pre-training on unstructured text corpora is the most optimal and efficient approach for enabling large language models to acquire task-performing capabilities via indirect supervision from examples of tasks that appear in the pre-training documents.
References
It is unclear, however, whether predicting the next token over pre-training documents is the most optimal or efficient way for models to absorb such capabilities.
— FineInstructions: Scaling Synthetic Instructions to Pre-Training Scale
(2601.22146 - Patel et al., 29 Jan 2026) in Section 1 (Introduction)