Scaling Law for World Models

Updated 22 October 2025

The paper introduces a data scaling law that quantifies prediction error reduction in world models as a power-law function of model parameters and dataset size, with error decaying proportionally to N^(-4/d).
World models are neural systems that learn latent environment representations for tasks like reinforcement learning, robotics, and autonomous driving by optimizing reconstruction and predictive simulations.
Empirical studies validate that scaling regimes adapt to domain complexity, where intrinsic dimension and data redundancy govern the effective rate of error improvement across diverse modalities.

World models are neural systems designed to learn latent representations capturing the dynamics of an environment, enabling predictive simulation, planning, or generative tasks in domains such as reinforcement learning, robotics, and autonomous driving. The data scaling law for world models refers to the predictable power-law improvement in error metrics such as prediction, reconstruction, or task success rate as the amount of model parameters, training data, or computational budget increases. Recent works have formalized this law, anchoring its exponent to the intrinsic dimension of the environment manifold, data redundancy, and algorithmic properties, and validated its universality across image, language, multimodal, and embodied agent modeling paradigms.

1. Theoretical Foundations: Neural Scaling Law and Intrinsic Dimension

The principal scaling law states that, in data-rich regimes where the learning task is regression over a $d$ -dimensional data manifold, the minimum loss $L$ achievable by a neural network of $N$ parameters satisfies

$L(N) \propto N^{-\alpha}$

with the scaling exponent, for piecewise-linear (e.g., ReLU) networks and cross-entropy or MSE losses, predicted as

$\alpha \approx \frac{4}{d}$

where $d$ is the intrinsic dimension of the data manifold (Sharma et al., 2020). The derivation follows a geometric argument: partitioning the manifold into $N$ regions, the loss per region scales as $s^4$ (if $s$ is the region side length), and the total number of regions is $N \sim s^{-d}$ , yielding

$L(s) \propto s^4 \propto N^{-4/d}$

The relationship is validated by measuring the intrinsic dimension via methods such as TwoNN and MLE applied to neural activations, and by direct fit of loss–versus–N curves in controlled teacher–student experiments.

This dimension-dependent scaling law generalizes to world models: the effective prediction error in trajectory forecasting, reconstruction, or generative modeling tasks is dictated by the model's ability to resolve the latent manifold structure of the world. As a corollary,

Domains with low $d$ (e.g., structured physics or constrained agent environments) yield faster scaling.
For high-dimensional worlds (e.g., unconstrained language modeling), improvements per parameter scale slowly, requiring substantial model capacity to halve the error.

2. Spectral Origin and Taxonomy of Scaling Regimes

Complementary analysis via kernel regression shows that power-law scaling relates to the spectral decay of the data covariance or kernel matrix. If the kernel eigenvalues decay as $\lambda_i \sim i^{-1/\beta}$ , excess risk converges as

$\mathbb{E}[\mathcal{E}(f_n)] \sim n^{-\alpha}$

with

$\alpha = \frac{2s}{2s + 1/\beta}$

where $s$ is source smoothness and $\beta$ quantifies the spectral tail (inverse redundancy index) (Bi et al., 25 Sep 2025). The scaling exponent is not universal—it is determined by the amount of redundancy in the data. Steeper spectral decay (higher $\beta$ ) implies less redundancy and a faster exponent, a phenomenon shown to be invariant under boundedly invertible transformations and universal across architecture choices, mixtures, and regimes (e.g., NTK, feature-learning).

Scaling regimes are classified as either

Variance-limited ( $\alpha \approx 1$ ): when data or model width is sufficiently large, loss decays as $1/D$ or $1/N$,
Resolution-limited ( $\alpha < 1$ , data-dependent): where finite model or dataset limits the model’s ability to resolve the manifold, with exponent controlled by dimension or spectral properties (Bahri et al., 2021).

Empirical studies confirm that model-size scaling and data-size scaling are dual, governed by the same spectral tail.

3. Practical Formulations and Predictive/Optimization Strategies

Empirical scaling laws for world models often use composite formulas combining model and data size: $L(N, D) = \frac{A}{N^{\alpha}} + \frac{B}{D^{\beta}} + E$ where $A, B, \alpha, \beta$ are fitted coefficients, and $E$ is irreducible error (Sengupta et al., 17 Feb 2025). Key practical formulas enable pre-training optimization, resource allocation, and extrapolation:

Minimal possible test loss prediction for given $N, D$ .
Compute-optimal splits: given fixed FLOPs $C$ , optimal $N^*(C) \propto C^a$ , $D^*(C) \propto C^{1-a}$ with $a+b \approx 1$ , where the split varies with tokenizer, architecture, supervision density, and domain (Pearce et al., 7 Nov 2024).
Critical batch size for optimal throughput: $B_{\mathrm{crit}}(L) = B_*/L^{1/\alpha_B}$ allowing structured time/compute trade-off (Su et al., 11 Mar 2024).

Recent results demonstrate that with dense, self-supervised world modeling supervision (e.g., future image prediction in autonomous driving), the scaling exponent is amplified—performance gains accelerate rather than saturate as dataset size increases (Li et al., 14 Oct 2025).

4. Empirical Validation and Domain-Specific Extensions

Extensive empirical tests confirm scaling law predictions for world models in

CNN image classifiers: measured $\alpha$ matches $4/d$ via hidden-layer activation dimension (Sharma et al., 2020).
Transformer LLMs: $\alpha \approx 0.076$ , effective $d \approx 53$ , consistent with the slow scaling observed due to high complexity of text worlds.
Autoregressive and diffusion world models for autonomous driving: inclusion of predictive scene modeling shifts data scaling from plateau (action-only) to accelerating gains (dense visual+action signals) (Li et al., 14 Oct 2025).
Embodied agent pre-training: optimal resource splits between $N$ and $D$ depend heavily on tokenization scheme, supervision density, and model architecture, with exponents $a$ ranging from $0.32$ (sparse action cloning) to $0.62$ (high token granularity) (Pearce et al., 7 Nov 2024).
Multimodal world models: scaling behavior is linear in $\sum_i \log(T_i / C_i) + \log P$ , where $T_i$ is raw data size, $C_i$ is per-modality compression factor, and $P$ is parameter count, allowing efficient deployment trade-off strategies on resource-constrained hardware (Sun et al., 10 Sep 2024).

Empirical scaling curves are further refined through automated discovery frameworks such as EvoSLD, which co-evolve symbolic expressions and optimization routines to achieve parsimonious, universally accurate functional forms, even surpassing human-designed laws in challenging scenarios (Lin et al., 27 Jul 2025).

5. Limitations, Breaks, and Adaptive Laws

The classic power-law scaling law does not always generalize across settings. Key exceptions include:

Sub-scaling regimes: high data density and overtraining cause diminishing returns, deviating from power-law predictions. Logistic decay factors must be added to scaling models to better represent real-world losses in large models and datasets (Chen et al., 13 Jul 2025).
Broken neural scaling laws (BNSL): scaling exponents may change at critical data or model size thresholds, sometimes causing performance to worsen before improvements resume. Piecewise or composite scaling forms are necessary.
Domain adaptation: optimal performance in world models often requires varying mixture ratios of domain-specific and general data, captured by adaptive laws such as the D-CPT formula (Sengupta et al., 17 Feb 2025).

Compression schemes—quantization, sparsity, mixed formats—require unified scaling laws that incorporate an intrinsic representation capacity, $\rho(R)$ , derived from the ability to fit random Gaussian data. Effective parameter count scales as $N' = N \cdot \rho(R)$ , allowing direct prediction of trade-offs under constrained resources (Panferov et al., 2 Jun 2025).

6. Implications and Applications in World Modeling

Data scaling laws provide a quantitative foundation for:

Predicting loss or error trajectory versus model/data/computation resources prior to training, enabling principled design choices and compute allocation for world models.
Optimizing pre-training strategies, e.g., selecting optimal $N$ , $D$ , or batch sizes for simulation/planning models in robotics and autonomous navigation.
Informing choice of tokenization schemes, supervision density, and architecture granularity to achieve compute-optimal performance for embodied agent modeling.
Enabling compute-optimal test-time scaling, where additional inference passes at deployment allow smaller foundation models to match or even outperform larger baselines (demonstrated with beam search and best-of-N selection strategies) (Cong et al., 31 Mar 2025).
Establishing evaluation benchmarks in closed-loop environments: success is best measured via embodied utility (task success) rather than isolated perceptual metrics—controllability and planning depth are more predictive of deployed utility (Zhang et al., 20 Oct 2025).

In sum, the data scaling law for world models consolidates geometric, spectral, empirical, and optimization insights, establishing a general framework for error prediction and resource allocation across modalities, tasks, and architectures. Understanding and leveraging the dimension-dependent scaling exponent, redundancy structure, and adaptive extensions is essential for designing world models that realize both theoretical and practical limits of performance.