Papers
Topics
Authors
Recent
Search
2000 character limit reached

Neural Scaling Laws & Data Manifold Dimension

Updated 31 January 2026
  • The paper demonstrates that intrinsic data manifold dimension, rather than the ambient dimension, drives the neural scaling exponents, with error decaying as N^(-2β/d).
  • It rigorously derives scaling laws across architectures including shallow transformers, ReLU networks, and kernel models, linking dataset size and model capacity to performance improvements.
  • The research highlights practical implications for geometry-aware network design, suggesting that optimizing data preprocessing and architectural alignment can significantly boost generalization efficiency.

Neural scaling laws characterize how the generalization error of neural networks decays as a function of dataset size and model capacity, often following precise power-law relationships. When the data distribution is concentrated on a low-dimensional manifold embedded in higher-dimensional space, the intrinsic dimension of the manifold, rather than the ambient dimension, fundamentally determines the exponents of these scaling laws. This relationship between data manifold dimension and scaling exponents has been investigated and rigorously derived in multiple theoretical and empirical frameworks, spanning shallow and deep architectures, kernel regimes, and generative models of data complexity.

1. Theoretical Foundations: Manifold Hypothesis and Scaling Exponents

The manifold hypothesis posits that real-world data, while existing in high-dimensional ambient spaces (e.g., token embeddings, image pixels), primarily concentrates on a compact, smooth, low-dimensional Riemannian manifold MRD\mathcal M \subset \mathbb R^D of dimension dDd \ll D. Given a target function f:MRf:\mathcal M \to \mathbb R with suitable smoothness—specifically, β\beta-Hölder continuity—the neural network's capacity to approximate ff, as well as the sample complexity required to learn ff, is controlled by dd, not DD.

Key theoretical results show that for architectures respecting the data geometry (e.g., shallow transformers with block depth O(logd)O(\log d)), the minimal LL^\infty error achievable with NN parameters scales as

TfL(M)2N2β/d\|T - f\|_{L^\infty(\mathcal M)}^2 \lesssim N^{-2\beta/d}

for target smoothness parameter β\beta (with β=1\beta=1 for Lipschitz functions). When learning from nn i.i.d. samples, the generalization error bound obeys

ExQT^n(x)f(x)2CDd2n2β/(2β+d)+o(1),\mathbb E_{x \sim Q} \big|\hat T_n(x) - f(x)\big|^2 \leq C\, D d^2\, n^{-2\beta/(2\beta + d) + o(1)},

where CC is a constant depending on data and manifold parameters. These scaling laws demonstrate the exponential sensitivity of approximation and generalization rates to the intrinsic dimension dd (Havrilla et al., 2024).

2. Derivation of Data and Model Scaling Laws from Manifold Geometry

Combining the bias (approximation error) and variance (estimation error) components, the mean squared prediction error E(n,N)E(n, N) in finite data and finite model regimes can be expressed as a two-term law:

E(n,N)C1nα(d)+C2Nβ(d),E(n, N) \approx C_1 n^{-\alpha(d)} + C_2 N^{-\beta(d)},

with exponents determined by dd and target function smoothness:

α(d)=2β2β+d,β(d)=2βd.\alpha(d) = \frac{2\beta}{2\beta + d}, \qquad \beta(d) = \frac{2\beta}{d}.

For the practical and common case β=1\beta=1, this yields α(d)=2/(2+d)\alpha(d) = 2/(2 + d) and β(d)=2/d\beta(d) = 2/d. These exponents quantify the rate at which increasing dataset size or parameter count improves test error, tightly linking scaling behavior to data manifold dimension (Havrilla et al., 2024, Bahri et al., 2021).

Alternative derivations, such as via piecewise-linear function approximation or kernel spectral analysis, yield consistent results; for example, a ReLU network with MSE or cross-entropy loss yields a parameter-scaling exponent α4/d\alpha \approx 4/d (Sharma et al., 2020), while kernel spectral decay implies scaling exponents α=t/d\alpha = t/d, where tt is the leading non-vanishing Taylor order of the target function (Bahri et al., 2021).

Scaling Law Regime Scaling Law Formula Exponent Dependence on dd
Approximation error L(N)N2β/dL(N) \propto N^{-2\beta/d} β(d)=2β/d\beta(d) = 2\beta/d
Model-size (ReLU, MSE) L(N)N4/dL(N) \propto N^{-4/d} α4/d\alpha \approx 4/d
Dataset-size L(n)n2β/(2β+d)L(n) \propto n^{-2\beta/(2\beta+d)} α(d)=2β/(2β+d)\alpha(d) = 2\beta/(2\beta + d)
Kernel spectrum L(N),L(n)Nt/d,nt/dL(N), L(n) \sim N^{-t/d}, n^{-t/d} α(d)=t/d\alpha(d) = t/d (for smooth/analytic ff)

3. Empirical Evidence Across Model Classes and Data Modalities

Extensive empirical evaluations validate the theory across model classes and data types. In teacher–student frameworks, varying the intrinsic dimension dd by manipulating input features confirms the expected scaling exponent: measured α\alpha agrees with $4/d$ for ReLU networks trained on MSE or cross-entropy loss (Sharma et al., 2020). For convolutions on image data, e.g., CIFAR-10, scaling exponents (α0.52\alpha \approx 0.52) and measured intrinsic dimensions (d7.6d \approx 7.6 via TwoNN) directly match the theoretical prediction $4/d$.

On LLMs, for instance GPT-2 "small" (117M parameters), observed exponents align with d90d \gtrsim 90 and α0.076\alpha \approx 0.076, as measured in the final-layer representations (Sharma et al., 2020). Data scaling exponents for small GPT-style models trained on natural corpora (OpenWebText, Tiny Stories, SQL extracts) yield observed α^0.10\hat\alpha \approx 0.10–$0.15$ and dd in the range $14$–$20$, closely agreeing with theoretical predictions α=2/(2+d)\alpha = 2/(2+d) (Havrilla et al., 2024).

Ablation studies (on embedding dimension, network width/depth, and sequence length) show that the estimated dd remains consistent within a margin of ±2\pm 2 under architectural variations, supporting the centrality of the data manifold dimension in determining scaling exponents.

4. Statistical and Approximation-Theoretic Mechanisms

Resolution-limited scaling arises when the model has sufficient capacity to interpolate training data and accuracy is dictated by how finely the d-dimensional data manifold is sampled or approximated. Approximating a smooth ff by partitioning M\mathcal M into NN regions yields typical region size N1/dN^{-1/d}, and local Taylor expansion of ff leads to MSE loss scaling as N4/dN^{-4/d} for piecewise-linear (ReLU) networks (Sharma et al., 2020). Similarly, kernel methods in the infinite-width limit exhibit power-law decay of generalization error where the exponent derives from the roughness of ff and the spectrum of the kernel on M\mathcal M (Bahri et al., 2021).

The architecture is also critical. Transformers with token interaction capacity achieve the optimal scaling with only O(logd)O(\log d) depth, leveraging efficient implementation of pairwise interactions and partition-of-unity decompositions, thus removing logarithmic factors present in standard feed-forward networks (Havrilla et al., 2024).

Variance-limited regimes, in which either dataset size or parameter count is the bottleneck, show universal scaling exponents of $1$ (i.e., L1/DL \propto 1/D or L1/NL \propto 1/N), independent of data geometry, reflecting statistical concentration rather than geometric resolution (Bahri et al., 2021).

5. Alternative Models: Percolation Theory and Criticality

Beyond manifold-approximation, percolation-theoretic models posit that clusters of similar subtasks (connected via a generative process on a dd-dimensional lattice) define the effective "quanta" of learning. At criticality, the distribution of subtask sizes is a power law, and the scaling exponent α\alpha is determined by the percolation Fisher exponent τ(d)\tau(d) through

α(d)=3ττ2,E(N)Nα(d).\alpha(d) = \frac{3 - \tau}{\tau - 2}, \quad E(N) \propto N^{-\alpha(d)}.

For high dd (d6d \geq 6, Bethe lattice), τ=5/2\tau = 5/2 and hence α=1\alpha = 1. In supercritical regimes, the manifold-approximation scaling is recovered, i.e., E(N)Nc/DE(N) \propto N^{-c/D}, with cc an approximation constant. This framework unifies and quantitatively grounds previous theories, and empirical analysis on toy percolation datasets confirms the sharp transition between discrete-subtask (quanta) and smooth manifold regimes (Brill, 2024).

6. Unified Perspective and Practical Implications

The evidence across diverse experimental paradigms and theoretical constructions consistently supports the central role of data manifold dimension dd in governing neural scaling laws. Regardless of whether the underlying model is a shallow transformer, a deep convolutional net, or an infinite-width kernel machine, the key scaling exponents for both model and data scaling are inversely proportional to dd.

Empirically, the ability to estimate dd robustly via maximum-likelihood/nearest-neighbor-based estimators (e.g., TwoNN) enables practitioners to predict scaling exponents prior to extensive experimentation. There is a strong duality between data-scaling and model-scaling exponents, concretely expressible as α=β/(β+1)\alpha = \beta/(\beta + 1), facilitating inference of one exponent from measurements of the other (Havrilla et al., 2024).

This suggests that data geometry is a primary governing factor for resource allocation in large-scale neural network training. The exponential sensitivity of scaling exponents to dd highlights that improvements in data preprocessing, representation, or architectural alignment to the latent manifold structure may yield disproportionately large gains in achievable generalization efficiency as compared to indiscriminate increases in model or dataset size.

7. Limitations, Edge Cases, and Open Directions

Scaling laws derived from manifold dimension require several key assumptions: the manifold must be sufficiently regular (positive reach), the target function adequately smooth (Hölder or twice differentiable), and the model or data regime within the resolution-limited phase. Deviations from these conditions, such as noisy data, non-smooth targets, or breakdown at NmaxN_{\max} (the parameter regime beyond which loss saturates at intrinsic entropy floor), can alter scaling behavior (Sharma et al., 2020). For non-ReLU activations, special target structure, or high intrinsic noise, exponents may deviate from $4/d$ or related formulae, and empirical finite-size or overfitting effects may further attenuate or truncate power-law scaling windows.

Kernel-spectrum-based scaling results rely on the spectrum of the population kernel saturating smooth-manifold-induced decay, and percolation-theoretic results presuppose context-independent allocation of model capacity across clusters or quantized tasks (Brill, 2024). For realistic data, estimation of dd remains reliable primarily for d20d \lesssim 20, and empirical identification of criticality regimes for percolation models is nontrivial.

A plausible implication is that further research into data-driven, geometry-aware architectures and spectral regularization strategies will be required to optimally exploit the scaling economy predicted by intrinsic dimension, as well as to unify the percolation-criticality and smooth-manifold paradigms under a broader theory of data complexity and representation learning.


References:

(Havrilla et al., 2024) Understanding Scaling Laws with Statistical and Approximation Theory for Transformer Neural Networks on Intrinsically Low-dimensional Data (Bahri et al., 2021) Explaining Neural Scaling Laws (Sharma et al., 2020) A Neural Scaling Law from the Dimension of the Data Manifold (Brill, 2024) Neural Scaling Laws Rooted in the Data Distribution

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Neural Scaling Law from Data Manifold Dimension.