Dice Question Streamline Icon: https://streamlinehq.com

Returns to data scale beyond the public internet “data wall”

Determine whether diminishing performance returns to increasing dataset size persist beyond approximately 15 trillion tokens of public internet text, characterize how specialized or high‑quality datasets alter return behavior, and ascertain whether technical scaling laws translate into economic returns when quality effects, network dynamics, and contamination risks are incorporated.

Information Square Streamline Icon: https://streamlinehq.com

Background

The paper notes empirical scaling laws showing diminishing marginal performance returns with increasing dataset size, based on regimes constrained by a “data wall” around 15 trillion tokens of public text. It emphasizes that evidence beyond this boundary is sparse and the relationship to economic returns is not established.

Resolving this uncertainty would clarify whether expanding or improving data supplies yields sustained productivity gains and how high-quality or specialized data modifies scaling behavior.

References

We do not know whether diminishing returns hold beyond this boundary, whether specialized or high-quality data exhibits different characteristics, or whether technical scaling laws map to economic returns once quality effects, network dynamics, and contamination risks are factored in.

The Economics of AI Training Data: A Research Agenda (2510.24990 - Oderinwale et al., 28 Oct 2025) in Section 5 (Representing data in the production function)