Data Scaling Laws in ML
- Data scaling laws are systematic principles that quantify how model performance improves with increased data, compute resources, and capacity.
- They describe performance using power-law relationships that separate reducible error from an irreducible loss, informing resource allocation.
- These laws offer actionable insights into data complexity, redundancy, and the balance between model size and dataset scale across different modalities.
Data scaling laws in machine learning describe systematic, often power-law, relationships between task performance (e.g., test loss) and the quantity of available data, model capacity, or compute budget. These principles, originally observed empirically in large-scale autoregressive generative modeling, have now been theoretically and empirically grounded across modalities such as language, images, audio, multimodal tasks, and mathematical problem solving. Scaling laws encode how performance improves as a resource is increased, the form of irreducible error, and the diminishing returns of data and capacity, and are deeply connected to the statistical structure and intrinsic redundancy of natural data distributions.
1. Universal Power-Law Plus Constant Scaling
Across generative image modeling, language, video, multimodal models, and problem solving, observed performance curves are well described by a “power-law plus constant” of the form
where is the cross-entropy loss on held-out data, is a scaling variable (such as dataset size, model size, or compute), is the irreducible (irreducible) loss—interpreted as the entropy of the true data distribution, and is the reducible loss (empirically aligning with the KL divergence between the true and model distributions) (Henighan et al., 2020, Droppo et al., 2021).
This scaling describes a regime in which performance smoothly, predictably improves with scale, regardless of bottleneck (model, data, or compute). Even after the reducible loss is a small fraction of , the power-law persists.
Table: Example Scaling Law Parameters (modality-specific, from (Henighan et al., 2020))
Modality | Exponent | Irreducible (nats/image) |
---|---|---|
Image (8×8) | ~0.30 | ≈ 2.1 |
Image (32×32) | ~0.18 | ≈ 2.8 |
Video (lowest) | ~0.21 | Varies |
The scaling exponents and constants are domain- and task-dependent but remain stable across scale. This form is robustly validated, including in audio (Droppo et al., 2021), jet physics (Batson et al., 2023), and transfer learning (Yang et al., 17 Apr 2025).
2. Compute-Optimal Model and Data Allocation
A central consequence is compute–optimality: for a fixed computational budget , the optimal model size increases as a power of compute,
with empirically close to $0.7$ in diverse modalities (Henighan et al., 2020). This results in sublinear increases in necessary data as models scale (with dataset size for ), indicating that data needs grow much more slowly than parameter count to maintain scaling efficiency.
Similar relationships appear in acoustic models (Droppo et al., 2021), where doubling model size only requires a 1.77× data increase, and in theoretical models predicting linear or equiparameterized scaling between number of effective parameters and dataset size as optimal for minimizing error (Jeon et al., 28 Jun 2024, Maloney et al., 2022).
Scaling Law for Joint Data and Model Limitations:
with all irreducible (floor) contributions factored explicitly (Droppo et al., 2021).
3. Information-Theoretic Interpretation and Statistical Structure
The cross-entropy loss is rigorously decomposed as
where is the entropy of the data distribution, and is the KL divergence between the true and learned distributions (Henighan et al., 2020, Hoffmann et al., 2022).
The reducible loss strictly tracks , and as model/data increase, the model distribution approaches the true distribution monotonically.
Statistical analysis of natural datasets reveals covariance matrices with power-law spectral decay (eigenvalues ), ensuring no sharp cutoff between informative and noise directions (Maloney et al., 2022, Bi et al., 25 Sep 2025). Nonlinear feature maps in neural networks extend this regime, so test loss scaling inherits exponents set by this spectral structure, with ultimate scaling determined by the data manifold’s intrinsic dimensionality (Havrilla et al., 11 Nov 2024, Brill, 10 Dec 2024).
4. Domain-Specific and Data-Dependent Extensions
Multilingual Scaling: In multilingual LMs, scaling laws generalize by modeling the test loss for each language family as a function of its sampling ratio in the mixture, model size, and data size: where is the sampling weight for family (He et al., 15 Oct 2024). This decouples analysis from individual languages and allows optimal mixture selection to minimize aggregate loss.
Data Complexity: Scaling law constants and exponents depend systematically on quantifiable data complexity measures such as gzip-compressibility (Pandey, 26 May 2024). For less compressible data, the compute–optimal frontier tilts toward prioritizing increased dataset size rather than model size. All constants in the classic law are modulated by this data-dependent factor: where is the gzip-normalized compressibility, linking scaling behavior directly to intrinsic data structure.
Data Mixtures: For large-scale pretraining, the optimal domain mixture can be determined via scaling law formulations (either additive or joint) that predict the loss on any target domain or set of domains including the effect of data mixture, model size, and training tokens: with a simplex vector over source domains. This enables principled, compute-budget-aware mixture design (Shukor et al., 12 Jul 2025).
5. Redundancy Laws and Spectral Foundations
The scaling exponent arises from the spectral-tail of the data covariance: if the spectrum is polynomial, , the redundancy index dictates learning curve slope (Bi et al., 25 Sep 2025). In kernel regression,
where is the function’s smoothness. Lower redundancy () steepens scaling, accelerating returns to data/model scale, a phenomenon robust to data representation, mixtures, finite-width approximations, and even deep architectures (including Transformers in both NTK and feature-learning regimes).
Empirically, reducing redundancy—by learning better representations or data preprocessing—can make the observed scaling exponent larger, improving data efficiency. This provides a rigorous, unifying explanation for the power-law behavior empirically observed in deep models.
6. Limitations, Transitions, and Practical Implications
Criticality and Phase Transitions: Finite latent dimension or data “task complexity” can cause scaling to plateau once the number of model parameters or data points approaches this intrinsic limit (Maloney et al., 2022, Brill, 10 Dec 2024). At the percolation threshold of data connectivity, scaling exponents are determined by the power-law size distribution of functional “quanta”; above this threshold, a dominant data manifold controls the scaling exponent, consistent with results from manifold approximation theory (Brill, 10 Dec 2024).
Irreducible Loss: The irreducible loss () represents the entropy floor of the data distribution. As models/data increase, returns diminish when approaching this limit, and further resource investment becomes exponentially less effective (Droppo et al., 2021, Henighan et al., 2020).
Uncertainty Scaling: Predictive epistemic uncertainties contract with increasing data size, typically as in parametric models; however, even in over-parameterized neural nets, residual epistemic uncertainty remains non-negligible, following power-law decay but rarely vanishing at practical scales (Rosso et al., 11 Jun 2025). This underscores the continuing need for uncertainty quantification in large-scale models, even with massive datasets.
Individual Data Contributions: The value of individual data points for a specific model shrinks with increasing dataset size in a log-linear (power-law) manner, but with significant heterogeneity between examples. Some data points with slowly-decaying exponents remain highly valuable even in large-data regimes (Covert et al., 30 May 2024).
Data Reuse: In data-constrained settings, reusing samples (multi-pass SGD) improves the effective scaling law; the test error becomes dependent on the total number of effective iterations, amplifying gains over strict one-pass bounds, provided the number of passes does not exceed a certain threshold (Lin et al., 10 Jun 2025).
7. Domain-Specific Scaling and Methodological Recommendations
- Classifier Performance: In high-dimensional nearest-neighbor classification, scaling transitions from fast (polynomial) to slow (exponential) rates depending on favorable geometry (signal alignment) or distributional structure (Yang et al., 2023).
- Transfer Learning: Scaling laws in visual transfer learning display boundaries where knowledge distillation outperforms standard training only up to a critical data threshold, after which further pretraining data favors direct training (Yang et al., 17 Apr 2025).
- Jet Physics and High-Energy Domains: Classifier loss in jet physics follows empirical power-law scaling, but exponents differ by method; model selection for a fixed data regime can be misleading, as “fast-scaling” methods will surpass “high baseline, slow-scaling” ones with sufficient data (Batson et al., 2023).
Practical Guidance:
- Optimal resource allocation demands matching model and data scaling according to empirical or theoretical exponents, often favoring larger models with proportionally less training per parameter at scale (Henighan et al., 2020, Jeon et al., 28 Jun 2024).
- Empirical estimation of constants and exponents should be performed with smaller models and data slices—these exponents generalize to larger settings when experimental setup (context length, tokenization, etc.) is matched (Su et al., 11 Mar 2024).
- Data mixture optimization should be formulated within the scaling law framework; small-scale piloting can predict the optimal mixture for any target tasks and compute regime, avoiding expensive trial-and-error (Shukor et al., 12 Jul 2025, He et al., 15 Oct 2024).
- Data structure complexity (estimated, for example, by gzip-compressibility) should guide budget allocation; “hard” data may necessitate larger effective dataset sizes for the same improvement (Pandey, 26 May 2024).
Data scaling laws thus provide not only phenomenological descriptions and predictive formulas for error curves, but also a theoretical, information-theoretic, and spectral unification of performance scaling in high-capacity models. They offer actionable guidelines for resource allocation, provide a lens for interpreting empirical performance in novel domains, and reveal fundamental limits set by data complexity, redundancy, and intrinsic task structure.