Not-Just-Scaling: Beyond Power Laws

Updated 12 July 2025

Not-Just-Scaling Laws are an evolving framework that challenges simple power-law scaling by incorporating context, design choices, and non-monotonic behaviors in complex systems.
They employ advanced models, such as smoothly broken power laws, to capture emergent phase transitions and reduce extrapolation error across fields like deep learning and robotics.
This approach informs practical model design by integrating architectural features, data composition, and uncertainty quantification while addressing fairness and societal impacts.

Not-Just-Scaling Laws refer to a growing body of research that interrogates, refines, and generalizes the classic scaling law paradigm: the empirical observation that increasing a system’s scale (e.g., data, parameters, or compute) yields predictable improvements in performance, often following simple power-law relationships. Rather than treating such laws as universal or sufficient for understanding complex systems, “Not-Just-Scaling Laws” encompass lines of inquiry that uncover limits to universality, reveal critical influences of design or context, introduce richer functional forms capable of modeling emergent phenomena and non-monotonic behaviors, and highlight the importance of structural choices, data composition, uncertainty, and societal context. This field traverses deep learning, physics, robotics, data science, and the social implications of machine learning.

1. Rethinking Universality: Scaling Law Limits and Contextuality

Classic scaling laws claim that increasing data and model size monotonically and predictably improves performance, often described by $y = a x^b + c$ , where $y$ is error or loss, $x$ is a resource (data/model/compute), and $b < 0$ . However, multiple studies challenge the universality of these laws. In urban science, for example, a systematic sensitivity analysis across thousands of city definitions demonstrates that most urban indicators are consistent with linear scaling (exponent $\beta \approx 1$ ) rather than the widely cited sublinear or superlinear regimes; the observed scaling exponent itself fluctuates considerably with the delineation of boundaries and indicator selection (1301.1674). This implies that simple models of scaling as a function of a single proxy (e.g., city population) are not robust or universally applicable.

A parallel in machine learning arises in critiques from the social sciences, which argue that performance metrics driving scaling law analysis are merely proxies for “latent” constructs of model output quality. As datasets scale, so does the diversity of user groups, and aggregate metrics may fail to capture value pluralism—some communities may experience degraded or misaligned performance as models grow larger, even if the overall metric improves (Diaz et al., 2023). Consequently, claims of universal benefit from scaling risk privileging majority or high-resource populations to the detriment of marginalized groups.

2. Beyond Power Laws: Broken and Non-Standard Functional Forms

Many phenomena display departures from simple monotonic scaling. The “Broken Neural Scaling Laws” (BNSL) framework generalizes traditional power-law models by introducing smoothly broken power-law functional forms that connect segments with different slopes (on a log–log plot) at one or more “break” points (Caballero et al., 2022). This construction enables accurate modeling and extrapolation of abrupt inflection points (“emergent phase transitions”), non-monotonic transitions (such as double descent in test error), and sudden changes in scaling, which standard power-law models cannot represent.

BNSL is empirically validated across a wide set of tasks and architectures, including modern deep vision and LLMs, diffusion models, coding tasks, reinforcement learning, and robustness. It demonstrates significantly lower extrapolation error compared to monotonic power-law models, especially for tasks where sharp capability transitions or non-monotonic behaviors are observed.

A concrete mathematical form is

$y = a + b x^{-c_0} \prod_{i=1}^n [1 + (x/d_i)^{1/f_i}]^{-c_i f_i}$

where parameters $(c_i, d_i, f_i)$ control slope shifts and the sharpness/location of each break, and $a$ and $b$ determine the scaling baseline and offset (Caballero et al., 2022).

3. Extending the Scaling Law Framework: Structural, Architectural, and Data Composition Effects

Scaling laws based solely on model and data size often obscure the substantial impact of data composition and architectural choices. An extensive meta-analysis of 92 LLMs shows that incorporating features such as the fraction of code in the pretraining corpus and the positional embedding mechanism (rotary embeddings outperform learned embeddings) boosts predictive power for downstream performance by 3–28% relative to using scale alone (Liu et al., 5 Mar 2025).

A critical insight concerns the code/natural language trade-off: with 15–25% of the pretraining data as code, a model achieves balanced capabilities; larger code proportions boost performance on coding tasks but harm natural language understanding. These findings challenge the efficiency of merely scaling parameters and tokens and motivate frameworks that systematize the inclusion of architectural and data mixture variables, thereby generalizing traditional scaling laws via functions of the form:

$\hat{s}_T(\mathcal{M}) = f_\theta([\mathcal{A}_\mathcal{M}; \mathcal{D}_\mathcal{M}])$

where $\mathcal{A}_\mathcal{M}$ and $\mathcal{D}_\mathcal{M}$ are vectors of architectural and data features respectively (Liu et al., 5 Mar 2025).

4. New Domains, Modalities, and Phenomena: Graphs, Robotics, and Uncertainty

The neural scaling paradigm has been extended and nuanced within emerging application domains:

Graphs: In graph learning, traditional measures such as the “number of graphs” poorly estimate effective data scale due to irregular graph topology. Reformulating data scaling in terms of the total number of edges (or nodes) produces a robust law for both node/edge/graph classification tasks. Moreover, model depth, often unimportant in NLP scaling, plays a distinct, task- and architecture-dependent role in graph neural network scaling, and excessive scaling can result in “model collapse” due to overfitting (Liu et al., 3 Feb 2024).
Robotics: In embodied AI, robot foundation models and language–robotics agents display scaling exponents for data and model size that are substantially more negative (e.g., $\beta \approx -0.38$ ) than those in LLMing (e.g., $\beta \approx -0.07$ ), suggesting comparatively rapid improvement with scale (Sartor et al., 22 May 2024). Scaling law behavior for "unseen" (generalization) tasks is notably shallower compared to "seen" tasks, revealing gaps in data diversity and model adaptability that are domain-specific.
Deep Learning Uncertainty: Predictive uncertainty (total, aleatoric, epistemic) exhibits its own scaling laws. In classical (identifiable) models, Bayesian posterior variance shrinks as $O(1/N)$ , but in overparameterized deep networks, epistemic uncertainty decays with a power law but does not vanish even for very large $N$ ; thus, Bayesian and ensemble methods remain necessary for robust uncertainty quantification, refuting the “so much data obviates uncertainty” argument prevalent in the field (Rosso et al., 11 Jun 2025).

5. Theoretical Foundations: Data, Spectrum, and Criticality

Recent theories clarify why scaling laws exhibit their characteristic universality and power-law exponents:

Linear Regression and SGD: In high-dimensional linear regression where the data covariance spectrum follows a power law, the reducible test error scales as $\Theta(M^{-(a-1)} + N^{-(a-1)/a})$ where $M$ is model size, $N$ is data size, and $a$ describes spectrum decay. The variance error vanishes due to implicit regularization of SGD, aligning theoretical bias–variance decompositions with empirical scaling law behavior (Lin et al., 12 Jun 2024).
Percolation Theory and Data Criticality: Modeling data distributions through percolation theory, two scaling regimes emerge. In the “quantized” critical regime, power-law distributions of discrete subtasks (clusters) yield a scaling exponent; in the supercritical regime, a dominant manifold governs scaling as in classical nonparametric regression. This theoretical synthesis unifies Zipfian “Q Sequence” and manifold-based interpretations, linking the structure of empirical data to observed scaling law universality (Brill, 10 Dec 2024).

6. Beyond Performance: Societal, Physical, and Emergent Constraints

Analyses of scaling laws also reveal broader constraints and implications:

Physical Limits: Optimized protocols in dense wireless networks show that linear scaling is unattainable due to finite spatial degrees of freedom and electromagnetic propagation limits—sum rates can be improved via clever clustering and scheduling but remain sub-linear at any practical scale (1404.6320).
Macroscopic Systems: In nonequilibrium physics, non-standard scaling laws characterize fluctuations in coupled oscillator systems—e.g., the diffusion constant of order parameter fluctuations can decay as $D \sim 1/N^a$ (with $a > 0$ ), revealing new universality classes not present in incoherent regimes (1304.3990).
Societal Values and Quality: Scaling law metrics can obscure differential impacts across subgroups, as larger datasets may introduce values and preferences at odds with chosen evaluation proxies, rendering quantitative improvement claims partial or even misleading at scale (Diaz et al., 2023).

7. Methodological Innovations and Future Directions

Not-Just-Scaling Laws stimulate several methodological advances and open directions:

Fitting and Extrapolation: Methodologies now employ change-point, broken power law, and multi-feature regression models (including tree-based learners and ensemble approaches) for more accurate performance prediction and understanding of emergent transitions (Caballero et al., 2022, Liu et al., 5 Mar 2025).
Extrapolation across Modalities: The field demonstrates that performance on larger, more complex problems (e.g., board game scale or robotics tasks) can be predicted from carefully designed small-scale experiments, provided scaling models accurately reflect both model, data, and problem complexity (Jones, 2021, Sartor et al., 22 May 2024).
Integration of Architecture, Data Mix, and Problem Structure: Modern frameworks increasingly include architectural features, domain composition, and outputs into predictive models to move beyond two-dimensional scaling plots (Liu et al., 5 Mar 2025).
Uncertainty Quantification: Scaling analysis now encompasses predictive uncertainties and provides operational guidance for deployment, especially in domains where safety and calibrated uncertainty are required (Rosso et al., 11 Jun 2025).
Critical Engagement: Future work considers richer, participatory approaches to evaluation metrics and data curation, especially in light of conflicting community values and the failure of universality (Diaz et al., 2023).

Conclusion

Not-Just-Scaling Laws denote both a research agenda and an emerging toolkit for understanding complex systems where classic scaling relationships, while powerful, are neither universal nor sufficient. This body of work integrates sensitivity analyses, broken functional forms, theory rooted in data and spectral properties, and domain-specific phenomena. It stresses the necessity of incorporating context, design, and broader value implications. Predicting and extrapolating model behavior therefore require embracing richer, high-dimensional, and sometimes non-monotonic models of system scaling. This evolving paradigm not only enhances predictive performance but also equips researchers and practitioners to design, evaluate, and deploy complex systems in a manner attuned to their inherent structure, dynamical regimes, and societal context.