Data-Constrained Scaling Laws

Updated 28 July 2025

Data-constrained scaling laws are power-law relationships that predict test error as functions of limited training data, model size, and compute resources.
They incorporate the impact of data quality, intrinsic geometry, and repetition strategies to mitigate the diminishing returns from scarce or redundant data.
These laws offer actionable guidelines for balancing model parameters and data augmentation to optimize performance in constrained environments.

Data-constrained scaling laws describe how model performance evolves under restrictions on training data availability, model size, or other resources such as compute or storage. Unlike unconstrained scaling, which presumes potentially limitless access to fresh, high-quality data, data-constrained law characterizes statistical, computational, and even physical bottlenecks common in real-world scenarios. These laws inform model and system design across domains including machine learning, signal processing, network theory, and physical modeling, establishing predictive formulas for test error or other metrics as a function of limited data, model parameters, and sometimes additional quantities such as number of training iterations, data reuse, or sample quality.

1. Mathematical Foundation and Canonical Forms

The prevailing mathematical structure of data-constrained scaling laws is a power-law (algebraic) relationship between achievable error (or loss) and the principal scaling parameters: dataset size ( $N$ ), model size ( $M$ ), and often additional variables. For a broad class of models and data, the test error $\mathcal{E}$ admits the decomposition: $\mathcal{E}(M, N) \asymp \Theta(M^{-\gamma} + N^{-\alpha}) + \text{(irreducible error)}$ where exponents $\gamma, \alpha > 0$ depend on properties of the model architecture and—critically—the statistical structure of the dataset and target function (Droppo et al., 2021, Lin et al., 12 Jun 2024, Maloney et al., 2022).

In linear regression settings, if the data covariance spectrum decays as a power law with exponent $a > 1$ , and the parameter prior has an aligned decay of degree $b > 1$ , then with stochastic gradient descent the reducible part of test error obeys

$\mathcal{E} \simeq \Theta(M^{1-b} + N^{(1-b)/a})$

(Lin et al., 12 Jun 2024).

Nonlinear models and neural scaling settings generalize this form, but the essential algebraic dependence remains, often tuned by intrinsic data properties such as the intrinsic dimension $d$ for data lying on low-dimensional manifolds (Havrilla et al., 11 Nov 2024), or by the spectrum of the data kernel (Maloney et al., 2022, Brill, 10 Dec 2024).

2. Data Structure and Regimes of Optimality

The validity and exponents of data-constrained scaling laws are governed not simply by dimensionality or sample size, but by the spectral structure and geometry of the data distribution. Key findings include:

In natural datasets with eigenvalue spectrum $\lambda_i \sim i^{-(1+\alpha)}$ , each additional datum contributes diminishing marginal explanatory power, setting up a regime where further scaling yields power-law decay in error (Maloney et al., 2022, Brill, 10 Dec 2024).
Data with low intrinsic dimension $d$ lying on a manifold yields scaling exponents for estimation and approximation as $n^{-2\beta/(2\beta+d)}$ and $N^{-2\beta/d}$ respectively (for $\beta$ -Hölder targets), implying that more highly structured data can be modeled more efficiently with fewer samples and shallow architectures (Havrilla et al., 11 Nov 2024).
The percolation-theoretic perspective interprets real-world data as partitioned into clusters (quanta) following a power law in their size distribution; the presence or absence of a dominant manifold cluster fundamentally changes the attainable scaling law (Brill, 10 Dec 2024).

Theoretical frameworks thus distinguish:

Subcritical (multi-quanta): Learning is dominated by gradually acquiring discrete subtasks, each with a marginal effect governed by the quantal size distribution.
Supercritical (manifold-dominated): Learning is governed by approximating a continuous manifold, reverting to classical approximation theory exponents (Brill, 10 Dec 2024).

3. Strategies for Data-Constrained Training and Data Reuse

In regimes where fresh data are limited, strategies to exploit available data more fully become essential:

Data Repetition and Reuse: In online or SGD settings, reusing data across multiple SGD passes (multi-pass SGD) improves the scaling law for test error. For linear regression with feature covariance exponent $a$ and target prior $b$ , one-pass SGD yields error $\Theta(N^{(1-b)/a})$ while multi-pass allows $L \sim N^{a/b}$ steps, improving error to $\Theta(N^{(1-b)/b})$ when $a > b$ (Lin et al., 10 Jun 2025).
Scaling with Augmented or Synthetic Data: Works in language modeling and visual transfer learning demonstrate that data augmentation (e.g., synthetic code, knowledge distillation from teacher models) can shift the effective scaling regime, and careful repetition or mixture can extend the benefit of limited corpora (Muennighoff et al., 2023, Yang et al., 17 Apr 2025, Chang et al., 4 Oct 2024).
Computational Resource Allocation: Optimization under compute or storage constraints can be formalized with storage-aware scaling laws, e.g.,

$\mathrm{Err}(n, L) \simeq \mathrm{Err}^* + A n^{-\alpha} + B L^{-\beta}$

subject to $nL = s$ (total storage), with the harmonic mean of the exponents controlling test error reduction (Mentzer et al., 25 Jul 2024).

These strategies imply a shift in design: in data-constrained regimes, it is optimal to carefully balance model size and data usage rather than naively scaling one dimension, and to invest in data reuse mechanisms and quality or diversity optimization.

4. Role of Data Quality and Effective Sample Size

Data quality imposes additional structure on scaling laws for parameter-constrained models. The “effective training tokens” construct combines text diversity (measured via compression-based metrics) and syntheticity (measured with a teacher model's perplexity) into a scaling factor $Q(D)$ , adjusting for redundancy and informativeness: $D_q = D \cdot \exp(c_1 \cdot Dr(D) + c_2 \cdot S(D))$ This correction can significantly increase the predictive power of scaling law fits, as evidenced by a +0.83 Pearson correlation with downstream task accuracy when replacing raw token count $D$ with $D_q$ (Chang et al., 4 Oct 2024).

A plausible implication is that in the presence of limited but high-quality data, small models can match or exceed the predictive accuracy of larger parameter-scaling efforts on noisier or redundant datasets.

5. Empirical Regimes and Generalizations Across Domains

Observed scaling exponents and their applicability vary by domain, model, and constraints:

In discriminative rescoring for ASR, normalized WER follows a power law in data size, with pretraining introducing an effective data transfer term scaling as $k D^\alpha N^\beta$ , so that larger models benefit more from pretraining (Gu et al., 2023).
In visual transfer learning, the “distillation boundary theory” identifies inflection points where knowledge distillation outperforms direct pretraining in data-scarce regimes, but is overtaken when sufficient data is available (Yang et al., 17 Apr 2025).
In scientific emulation (e.g., stellar spectra), optimal resource allocation requires balanced scaling of data and model, following empirically fitted exponents (e.g., tenfold compute increase should be apportioned as approximately $2.5\times$ in data and $3.8\times$ in model size for a reduction in mean squared error of $7\times$ ) (Różański et al., 24 Mar 2025).
In domains such as power systems, foundation models show power-law data scaling but quickly saturating gains from parameter scaling, suggesting resource expenditure should focus more on acquiring diverse demonstrations and scenarios rather than extreme model scaling (Liu et al., 25 Mar 2025).

6. Limitations, Breakdown, and Practical Constraints

Several limitations are highlighted in empirical and theoretical analyses:

Scaling laws generally break down at extreme ends: in ultra-low data regimes (performance saturates at random guessing) or ultra-high data regimes (a floor due to irreducible noise or Bayes error) (Gu et al., 2023).
Not all performance improvements translate to all subpopulations; aggregate scaling laws may mask disparities between communities or tasks (Diaz et al., 2023).
Scaling exponents may change or reach plateaus when underlying assumptions about data structure (e.g., the extent of the power-law spectrum or the emergence of new regimes in percolation-based models) are violated (Maloney et al., 2022, Brill, 10 Dec 2024).
Optimizing under joint constraints (e.g., storage, annotation cost, compute, and inference) may produce non-intuitive trade-offs, such as recommending increased compression combined with larger sample size for fixed storage, or smaller models for inference-limited scenarios (Mentzer et al., 25 Jul 2024, Fang et al., 27 Mar 2024).

7. Broader Implications and Design Recommendations

The theory and empirical evidence for data-constrained scaling laws underscore several design principles for researchers and engineers:

Resource-constrained regimes require joint optimization across data, model, and compute, often necessitating multi-pass or enhanced data reuse strategies that are formally grounded in scaling law analysis (Lin et al., 10 Jun 2025).
Data quality can be as critical as data quantity; filtering, deduplication, diversity maximization, and mixture with synthetic data all play a crucial role in determining the effective sample size (Chang et al., 4 Oct 2024).
Algebraic (power-law) scaling, rather than logarithmic or exponential laws, is the universal signature of complex systems with nontrivial constraints—be they geometric, statistical, or application-specific.
Participatory evaluation and pluralistic metric design are required to capture the full impact and diversity of performance, avoiding the pitfalls of single-metric or aggregate scaling law analyses (Diaz et al., 2023).

Data-constrained scaling laws now inform best practices for developing, deploying, and evaluating learning systems in resource-limited and high-stakes environments. They serve both as predictive tools—allowing extrapolation and optimization—and as cautionary guides, illuminating how system performance can be plateaued, redirected, or achieved more efficiently under real-world limitations.