Characterizing the Relationship Between Training Data Mixtures and Emergent LLM Abilities
Characterize the complex, high-dimensional relationship between the composition of training datasets for large language models—specifically the proportions and properties of sources such as web text, books, code, and scientific papers—and the emergent capabilities exhibited by models trained via next-token prediction, with the aim of guiding optimal data-mixture design under resource constraints.
References
While heuristic understanding exists—for instance, that a higher proportion of code in the training data generally leads to stronger coding abilities—the complex, high-dimensional relationship between data mixture and emergent abilities is largely unknown.
— Do Large Language Models (Really) Need Statistical Foundations?
(2505.19145 - Su, 25 May 2025) in Section “The central role of data,” subsection “Data mixture and attribution”