Returns to data scale beyond the public internet “data wall”
Determine whether diminishing performance returns to increasing dataset size persist beyond approximately 15 trillion tokens of public internet text, characterize how specialized or high‑quality datasets alter return behavior, and ascertain whether technical scaling laws translate into economic returns when quality effects, network dynamics, and contamination risks are incorporated.
References
We do not know whether diminishing returns hold beyond this boundary, whether specialized or high-quality data exhibits different characteristics, or whether technical scaling laws map to economic returns once quality effects, network dynamics, and contamination risks are factored in.
— The Economics of AI Training Data: A Research Agenda
(2510.24990 - Oderinwale et al., 28 Oct 2025) in Section 5 (Representing data in the production function)