On the origin of neural scaling laws: from random graphs to natural language

Published 15 Jan 2026 in cs.LG, cond-mat.dis-nn, cs.AI, and stat.ML | (2601.10684v1)

Abstract: Scaling laws have played a major role in the modern AI revolution, providing practitioners predictive power over how the model performance will improve with increasing data, compute, and number of model parameters. This has spurred an intense interest in the origin of neural scaling laws, with a common suggestion being that they arise from power law structure already present in the data. In this paper we study scaling laws for transformers trained to predict random walks (bigrams) on graphs with tunable complexity. We demonstrate that this simplified setting already gives rise to neural scaling laws even in the absence of power law structure in the data correlations. We further consider dialing down the complexity of natural language systematically, by training on sequences sampled from increasingly simplified generative LLMs, from 4,2,1-layer transformer LLMs down to language bigrams, revealing a monotonic evolution of the scaling exponents. Our results also include scaling laws obtained from training on random walks on random graphs drawn from Erdös-Renyi and scale-free Barabási-Albert ensembles. Finally, we revisit conventional scaling laws for language modeling, demonstrating that several essential results can be reproduced using 2 layer transformers with context length of 50, provide a critical analysis of various fits used in prior literature, demonstrate an alternative method for obtaining compute optimal curves as compared with current practice in published literature, and provide preliminary evidence that maximal update parameterization may be more parameter efficient than standard parameterization.

Abstract PDF Upgrade to Chat

Summary

The paper shows that neural scaling laws appear even with data, like random walks on Erdős–Rényi graphs, that lack intrinsic power-law structure.
It analyzes datasets ranging from simple bigrams to natural language, revealing a monotonic decline in scaling exponents as data complexity increases.
The study advocates for nonparametric regression methods over traditional Chinchilla 2D fits to enhance predictive validity in scaling law analyses.

On the Origin of Neural Scaling Laws: From Random Graphs to Natural Language — A Technical Analysis

Introduction and Context

The characterization of neural scaling laws—empirical power-law relationships governing test loss as a function of data size ( $D$ ), model parameters ( $N$ ), and compute ( $C$ )—has underpinned practical advances in deep learning, especially in LLM pretraining. This study probes the origin of these scaling laws by systematically controlling the complexity of the data. In particular, it investigates whether data exhibiting no underlying power-law structure (e.g., random walks on Erdős–Rényi graphs) can lead to power-law-like neural scaling behavior during next-token prediction with transformers. The methodology spans from these simplified settings to natural language, constructing interpolating sequences via datasets sampled from intermediate generative models (bigrams, shallow transformers), and compares the evolution of scaling exponents as data complexity increases.

Scaling Laws in the Absence of Data Power Laws

A central claim is that neural scaling laws emerge robustly even when the data lacks inherent power-law statistics, directly contradicting the notion that scaling exponents are entirely inherited from the compositional structure of the data. The authors empirically validate this by training 2-layer transformers on unbiased random walks over Erdős–Rényi graphs (8k nodes, 50k edges), where auto-correlations and degree distributions exhibit no power-law behavior (confirmed via stationary and transition probability statistics). The test loss as a function of both $N$ and $D$ closely follows a power law:

$L(N)_D = E_D + A_D N^{-\alpha_D}, \qquad L(D)_N = E_N + B_N D^{-\beta_N}$

with highly consistent exponents ( $\overline{\alpha_D} = 1.028$ , $\overline{\beta_N} = 0.749$ ), minimal deviation, and mean squared error (MSE) 5–100 $\times$ lower than the best exponential fit.

Figure 1: Scaling laws for 2-layer transformers trained on unbiased random walks on an Erdős–Rényi graph; neither the walks nor the graph exhibit data power laws.

The implication is that the geometric or algebraic details of the data manifold are not strictly necessary prerequisites for neural scaling to arise; architectural and optimization dynamics in sequence modeling tasks suffice to induce them.

Evolution of Scaling Exponents with Data Complexity

To dissect the dependency of scaling exponents on data complexity, the paper constructs a monotonic pathway from bigram LLMs (random walks on graphs), to synthetic datasets generated by small transformers trained on natural text (T1L, T2L, T4L), up to true natural language. Entropy (measured via fitted irreducible loss $E$ ) serves as a rough complexity quantifier.

Figure 2: Mean scaling exponents $\overline{\alpha_D}$ and $\overline{\beta_N}$ for random walks, language bigrams, TnL, and natural language, showing monotonic evolution with dataset entropy/complexity.

Key findings:

Language bigram models: $\overline{\alpha_D} \approx 0.98$ , $\overline{\beta_N} \approx 0.55$ (clear power law fit).
T1L, T2L: Exponents smoothly interpolate between bigram and true natural language ( $\overline{\alpha_D}$ decreases with complexity).
Natural language (FineWeb-edu): For shallow transformers, $\overline{\alpha_D}$ drops to 0.35–0.5 depending on parameterization.

This monotonicity confirms that neural scaling exponents can be modulated by the complexity of the generative process underlying the data, with higher complexity reducing parameter efficiency.

Fitting Methodology and Critique of Prior Practice

The authors emphasize the imperative necessity of including the irreducible loss $E$ when fitting scaling laws, as omitting $E$ leads to significant underestimation of $\alpha_D$ and $\beta_N$ . Moreover, the widespread use of two-dimensional Chinchilla-style parametric fits for $L(N,D)$ is shown to yield substantially higher fit error compared to nonparametric alternatives (e.g., neural network or kernel regression fits):

Figure 4: Left: 1D fits to $L(N)_D$ on Pythia models; Right: comparison of parametric (Chinchilla 2D) and nonparametric fitting for Chinchilla data.

Figure 6: Compute-optimal scaling laws extracted using the neural network fit for $L(N,D)$ , outperforming 2D Chinchilla in MSE.

The recommendation is clear: practitioners should move away from Chinchilla 2D analytic fits for $L(N,D)$ in favor of ML-based regression techniques, as these provide both improved predictive validity (crucial for extrapolation) and model selection advantage.

Language Modeling: Short Context and Parameterization Effects

The paper demonstrates that the principal discrepancies between the “Kaplan” and “Chinchilla” scaling laws in prior literature (re: compute-optimal balance between data and parameters) are recoverable with shallow transformers (2 layers, context 100). The only substantial variable affecting these results is whether embedding parameters are included in $N$ , confirming recent findings that much of the open issue is methodological rather than empirical.

Figure 8: Scaling law fits for 2-layer transformer on natural language with maximal update parameterization ( $\mu$ P); mean exponents and compute scaling verified.

An additional novel finding is that, at these relatively small scales, maximal update parameterization ( $\mu$ P) yields improved parameter efficiency (larger $\overline{\alpha_D}$ , smaller compute-optimal $a$ exponent), suggesting that compute-optimal scaling is not necessarily achieved at a constant token-per-parameter ratio, contesting established heuristics from literature.

Random Walks on Scale-free and Biased Graphs

The robustness of neural scaling laws is further substantiated through experiments on Barabási–Albert graphs (which do exhibit degree distribution power laws) and on random walks where transition probabilities are power law distributed. In all such cases, scaling persists—with the fit quality sometimes improving when data correlations are heavy-tailed. This supports but does not necessitate the “data power law $\Rightarrow$ neural scaling law” thesis.

Figure 10: Scaling laws for Barabási–Albert graph; exponents and MSE confirm robust scaling regardless of data correlation structure.

Theoretical and Practical Implications

Data-independence of scaling laws: Neural scaling appears as an emergent property of the optimization/modeling process rather than a direct reflection of explicit power law statistics in the data.
Parameter efficiency tuning: By adjusting dataset complexity, one can systematically dial the scaling exponents, with potential for designing more compute-efficient training regimens for novel architectures or modalities.
Best-fitting methodology: Regression (NN or kernel) outperforms analytic formulas for $L(N,D)$ , highlighting the need for methodological reevaluation in scaling law research.

Practically, these findings inform how practitioners should interpret scaling exponents and select model/data scaling for a given compute budget. Theoretically, they highlight the necessity for new analytic perspectives on sequence model overparameterization and gradient-based learning dynamics, beyond explanations tying scaling law universality solely to data manifold properties.

Future Directions

The study motivates several lines of investigation:

Characterization of scaling laws under multi-epoch training, data augmentation, and with more complex graph/hypergraph structures to bridge the gap between artificial and natural complexity.
Systematic study of the emergence (or lack) of power law statistics within model activations, even when absent from the raw data.
Sensitivity analyses on hyperparameter optimization and window-of-fit, especially in the context of large-scale, nonstationary datasets.

Conclusion

This paper demonstrates that neural scaling laws in cross-entropy sequence modeling tasks are not contingent on power-law structure in the underlying data and can be observed from random graph walks up to natural language using transformers. The scaling exponents monotonically decline with data complexity, and best practice in empirical fitting calls for robust regression methods over traditional two-dimensional analytic formulas. The results challenge received intuition linking scaling exponents rigidly to data structure, extending the theoretical landscape for understanding the origins and limits of scaling in deep learning systems.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Plain-English Guide to “On the origin of neural scaling laws: from random graphs to natural language”

Overview (what the paper is about)

This paper tries to understand why bigger and better-trained AI LLMs get more accurate in such a predictable way. In many AI tasks, the error (how wrong the model is) drops following a simple pattern called a “power law” as you increase:

the size of the model (number of parameters),
the amount of data it sees,
and the total compute (roughly, how much calculation you spend).

People have guessed that these neat patterns come from patterns already inside the data (like Zipf’s law in word frequencies). This paper tests that idea by building very simple, controllable “toy worlds” and showing that the same scaling patterns appear even when the data itself doesn’t have those power-law patterns. Then they gradually dial the data’s complexity up, from simple graph walks to real language, and watch how the scaling changes.

Key questions (what the authors wanted to find out)

Do neural scaling laws appear even when the training data has no obvious power-law structure?
How do the scaling rules change as we make the data more or less complex?
What is the best way to measure and fit these scaling laws so we can plan training efficiently?
Can simple models (like small, 2-layer transformers) already show the main scaling effects seen in big LLMs?
Are there training setups (like a specific parameterization called “μP”) that use parameters more efficiently?

How they studied it (methods in everyday language)

The authors built a step-by-step ladder from very simple to more realistic data:

Graph walks as simple “language”: Think of a graph as cities (nodes) connected by roads (edges). A “random walk” is like a traveler who starts in some city and repeatedly chooses a random road to take next. If you write down the sequence of visited cities, it looks like a sequence of tokens (like words). Predicting the “next city” from the current one is just like predicting the next word from the previous word. In language modeling, predicting based on the last one word is called a “bigram” model.
Different graph types:
- Erdős–Rényi graphs: random connections, no power-law patterns in node connections.
- Barabási–Albert graphs: “scale-free” graphs where a few nodes have many connections (a power-law pattern).
- They also tried “biased” walks where some paths are more likely, to gently introduce power-law-like structure.
Dialing complexity toward real language: 1) Language bigrams (pairs of tokens) learned from a real dataset. 2) Text generated by tiny transformers trained on real data (1-, 2-, and 4-layer generators), which captures more structure than bigrams but is still simpler than true language. 3) Real natural language.
Measuring learning curves:
- you increase model size (N),
- you increase data size (D),
- or you optimize the mix of both for a fixed compute budget.

A simple, widely used fit for these curves is:

$L(N)_D = E_D + A_D\,N^{-\alpha_D} \quad\text{and}\quad L(D)_N = E_N + B_N\,D^{-\beta_N}.$

L is the loss (error).
E is the “irreducible loss” (like the part you can’t beat because the task has some inherent uncertainty).
The exponents α and β tell you how quickly things improve with more parameters or more data. Bigger exponents mean faster improvement.
- Fitting the curves carefully:
They show it’s crucial to include the “irreducible loss” E when fitting. If you drop it, you can badly underestimate how fast models improve.
They compare different fitting methods and argue that a common 2D formula (“Chinchilla fit”) is less accurate than using standard ML regression (like a small neural network) to map (N, D) → loss. This matters if you want reliable “compute-optimal” plans—how best to spend a fixed budget across model size and data.

Main findings (what they discovered and why it matters)

Here are the essential takeaways:

Scaling laws show up even with “no-power-law” data
- Training transformers to predict random walks on simple random graphs (Erdős–Rényi), where the data has no built-in power-law patterns, still produced clean power-law learning curves.
- Why this matters: It suggests that the neat scaling behavior is not only a property of the data; it also comes from the learning setup (architecture + objective + optimization). That’s a big clue about the deeper “why” behind scaling laws.
As data gets more structured, scaling exponents change in a smooth, sensible way
- They “dial up” data complexity: language bigrams → text from tiny transformers (T1L, T2L, T4L) → real language.
- One of the key exponents (roughly, the one tied to data size, β) stays fairly stable (around one-half), while the exponent tied to model size or data complexity changes in a monotonic way. In plain terms: the “shape” of improvement changes gradually and predictably as the data becomes richer.
Simple models can reproduce big, famous results
- Using tiny 2-layer transformers with short context windows (like 100 tokens), they recover several well-known scaling behaviors from earlier large-scale studies. Even differences between famous results (like OpenAI’s Kaplan et al. vs. DeepMind’s Chinchilla rules) can be explained partly by whether you count “embedding” parameters (the big tables that map tokens to vectors) in the model size. This means you don’t always need huge, expensive experiments to study scaling—small, careful ones can teach a lot.
Better fitting = better planning
- They argue that:
- You must include the irreducible loss E in fits, or you’ll underestimate how quickly models can improve.
- A popular 2D formula (the “Chinchilla” fit) often fits the data worse than standard ML regression. Using a small neural net to fit loss as a function of (N, D) can give more accurate predictions for “compute-optimal” training plans (how to split your budget between model size and data size).
- Why this matters: Better fits mean better decisions when spending time and money on training.
A training setup called μP may be more parameter-efficient
- “Maximal Update Parameterization” (μP) is a way to set a model’s internal scaling so different sizes train more consistently. The authors see early signs that μP might get more out of each parameter than standard setups, which could shift the “best” way to allocate compute.

Why this is important (implications and impact)

Understanding the roots of scaling laws helps us predict the future of AI progress. If we know why and how error drops with more data and compute, we can plan smarter training runs and save huge amounts of resources.
The fact that scaling laws appear even without power-law patterns in the data means these laws are more universal than we thought. They likely come from how transformers learn sequences with the cross-entropy objective, not only from quirks in natural data.
The “dial up the complexity” approach gives a controlled playground to test theories. Researchers can run cheap experiments on synthetic data (like graph walks) and still learn lessons that apply to real language.
Better fitting tools (including keeping the irreducible loss term and using standard ML regression) can make compute planning more reliable across labs and projects.
If μP is more parameter-efficient, future models could be trained more economically, reaching the same accuracy with fewer parameters.

In short: The paper shows that neural scaling laws are robust, not just a coincidence of natural data. It offers clearer measurement tools, cheaper testing grounds, and hints for more efficient training—all of which can speed up and stabilize progress in building smarter LLMs.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

The paper leaves several concrete questions and limitations that future work could address:

Formal mechanism: No theoretical explanation is provided for why transformers trained with cross-entropy on Markov sequences exhibit power-law loss scaling even when the data lacks power-law correlations; what optimization dynamics, architectural properties, or capacity/variance trade-offs set the exponents?
Exponent attribution: The study does not map scaling exponents to measurable dataset or graph properties (e.g., spectral gap, mixing time, degree heterogeneity, stationary entropy, rank–frequency slope); can α, β, γ be predicted from such statistics?
Limited graph ensembles: Experiments focus on ER and BA graphs with mostly unbiased, stationary walks; directed graphs, weighted graphs with structured community/hierarchy, hypergraphs, and higher-order n-grams (n>2) are unexplored.
Incomplete sweep of correlation strength: The biased-walk parameter κ is only tested at 0 and 1; a continuous sweep of κ is needed to quantify how exponents and fit quality evolve with the heaviness of the transition probability tail.
Scale limitations: Results are predominantly with 2-layer transformers and short contexts (50–100) at modest N and D; do the same conclusions hold at scales typical of state-of-the-art LLM pretraining (longer context windows, deeper models, larger parameter counts and datasets)?
TnL dataset fidelity: The “Transformer-generated n-layer (TnL)” synthetic datasets are assumed to interpolate complexity toward natural language, but their closeness to real language (distributional properties, long-range dependencies) is not validated; how do TnL statistics (entropy, mutual information, memory length) compare to real corpora?
Complexity measure: Dataset “complexity” is proxied by the irreducible loss E from 1d fits; this is a loose lower bound and may be biased by limited N/D; develop robust complexity metrics (e.g., conditional entropy at varying orders, mutual information decay, long-range correlation measures) and relate them quantitatively to the scaling exponents.
Long-range dependencies: Random walks (bigram-equivalents) have short memory and exponential autocorrelation; how do scaling laws change in datasets with controlled long-range dependencies (e.g., higher-order n-grams, hierarchical grammars, synthetic processes with power-law memory)?
Bigram truncation and smoothing: The language bigram model removes rare bigrams (≤5 occurrences), altering tail behavior; assess how rare-event truncation and smoothing choices (e.g., Kneser–Ney, additive smoothing) affect scaling exponents and fit quality.
Architecture breadth: Only decoder-only GPT-style models are studied; head count, width/depth trade-offs, positional encodings, normalization schemes, residual scaling, and attention variants are not ablated; how do these architectural choices shift scaling exponents and compute-optimal policies?
Loss/objective scope: Experiments use cross-entropy next-token loss; do scaling laws and exponents persist under other objectives (MSE, contrastive pretraining, sequence-level losses), finetuning regimes, or RLHF?
Training regime: One-epoch training (C≈6ND) is assumed; early stopping, multi-epoch data reuse, curriculum schedules, and optimizer settings (and their interaction with compute accounting) are not explored; how do these change optimal a, b, γ?
Embedding parameter inclusion: The reconciliation of Kaplan vs. Chinchilla through embedding inclusion is shown only in shallow models; quantify systematically across depths/context lengths whether embedding accounting is the dominant factor and how other factors (learning-rate schedules, optimizer, weight decay) contribute.
muP parameterization: Evidence that maximal-update parameterization (muP) is more parameter-efficient is preliminary; a comprehensive, multi-scale comparison (varying depth, width, context, datasets) is needed to establish tokens-per-parameter compute-optimal laws under muP and their robustness.
Fit methodology and uncertainty: While 1d power-law fits outperform exponentials, other plausible forms (stretched exponentials, broken power laws, log-polynomial corrections) are not evaluated; provide principled model selection, uncertainty quantification for exponents, and assess potential regime changes (piecewise scaling).
2d regression extrapolation risk: Compute-optimal curves derived via neural-network regression lack interpretability and theoretical guarantees; evaluate extrapolation sensitivity, compare with Gaussian process or spline-based nonparametric regressors, and propagate uncertainty into a, b, γ estimates.
Spectral links: The relationship between the transition matrix spectrum (eigenvalue gap, tail distribution) and learned scaling exponents is hypothesized but not tested; design graph ensembles with controlled spectra to directly test exponent dependence on spectral properties.
Domain generality: Only language data and graph walks are studied; test whether “scaling without data power laws” generalizes to other domains (vision, audio, code, mathematics) and whether exponents differ systematically by modality.
Capability-level validation: The study focuses on token-level loss; examine whether scaling exponents translate consistently into downstream capability scaling (QA, reasoning, long-context tasks), especially in settings with short memory versus long dependencies.
Hyperparameter robustness: Exponent estimates are based on limited hyperparameter sweeps and a fixed set of schedules; quantify sensitivity to learning rate, batch size, weight decay, optimizer, and initialization across datasets and architectures.
Reproducibility and seeds: The stability of exponents across random seeds and runs is not reported; provide multi-seed variance and confidence intervals for fitted scaling exponents and compute-optimal parameters.
Compute accounting: The FLOPs model C≈6ND ignores overheads and hardware/precision effects; calibrate actual compute and training time across settings and assess sensitivity of compute-optimal conclusions to realistic cost models.
Directed vs. undirected walks: Language bigrams correspond to directed transitions; many graph experiments use undirected unbiased walks; systematically compare scaling exponents between directed and undirected settings to align with language modeling more closely.
Data diversity: Fineweb-edu and GPT-2 tokenization are used; investigate sensitivity of exponents to corpus composition (web vs. books), tokenization schemes (BPE vs. unigram), and data quality filtering.
Depth discrimination: T4L sequences produce nearly identical exponents to T2L for a 2-layer learner; test whether deeper learners can distinguish T2L vs. T4L and whether the scaling exponents reflect the generator’s depth or compositional hierarchy.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

The following are actionable use cases that can be deployed now, organized by sector. Each item notes key assumptions or dependencies that affect feasibility.

Industry

Compute-optimal training planning for LLMs
- Use one-dimensional loss fits with irreducible loss terms and neural-network regression to predict compute-optimal curves and choose N and D for a fixed budget.
- Potential tools/products: “Compute-Optimal Planner” (a small internal service to fit L(N,D) and output N_opt, D_opt), “Scaling Dashboard” (MLOps integration to track loss curves and exponents).
- Workflows: run a small grid of training runs (vary N, D), fit using 1D power laws and NN regression, derive N_opt/D_opt, decide whether to include embeddings in N for reporting, then launch full-scale training.
- Assumptions/dependencies: consistency of optimizer and training schedule across the grid; reliable estimation of irreducible loss E; data/architecture held fixed during extrapolation.
Low-compute A/B evaluation of optimizers and architectures
- Reproduce key language-model scaling-law results with 2-layer transformers and short context lengths (≈100) to compare algorithmic changes before committing to high compute.
- Potential tools/products: “Shallow-LLM Eval Harness” that standardizes 2-layer tests with fixed context length and batch sizes.
- Assumptions/dependencies: correlation between shallow-model scaling behavior and deeper models; same loss metric (cross-entropy); consistent tokenization.
Tunable-complexity synthetic benchmarks for sequence modeling
- Use random walks on Erdős–Rényi and Barabási–Albert graphs and transformer-generated language (TnL) sequences to benchmark scaling behavior and training stability without expensive real corpora.
- Potential tools/products: “GraphWalkBench” (dataset generator + reference tasks), “ComplexityDial” (pipeline to generate T1L/T2L/T4L and bigram datasets).
- Assumptions/dependencies: synthetic tasks capture salient scaling dynamics of real pretraining; graph parameters and walk biases are controllable and well documented.
Parameter-efficiency gains via maximal update parameterization (μP)
- Adopt μP to potentially reduce parameters for a given performance envelope and revisit token-per-parameter heuristics.
- Workflows: compare μP vs. standard parameterization on small grids, then deploy μP broadly if confirmed.
- Assumptions/dependencies: preliminary evidence holds across tasks/data; optimizer and LR schedules are compatible; adequate internal validation.
Reporting and comparability standards for scaling experiments
- Include irreducible loss (E) in all fits; disclose whether embedding parameters are counted in N; avoid defaulting to the 2D Chinchilla fit when NN/kernel regression fits yield lower MSE.
- Potential tools/products: “ScalingFit” library (1D fits with E, NN/kernel regression, bootstrap CI).
- Assumptions/dependencies: organizational alignment on metrics; access to fitting tooling and CI pipelines.
Energy and cost forecasting for training runs
- Use improved compute-optimal predictions (C ≈ 6ND) to forecast energy draw and cost for different N/D mixes and training durations.
- Assumptions/dependencies: stable FLOP/machine efficiency estimates; known data throughput; consistent epoch policy (often 1-epoch in the paper).

Academia

Experimental platforms to study origins of scaling laws
- Leverage tunable-complexity sequences (graph walks, TnL, bigrams) to bridge theory (linear/kernel models with MSE) and practice (auto-regressive models with cross-entropy).
- Assumptions/dependencies: shared datasets and code; reproducible fitting procedures (including offsets and bootstrap CIs).
Teaching modules on scaling phenomena
- Course labs using random walks and shallow transformers to demonstrate how scaling laws emerge even without power-law data correlations.
- Assumptions/dependencies: modest compute resources (10^13–10¹⁶ FLOPs); accessible open-source tokenizers/transformers.
Methodological improvements for fitting loss curves
- Replace 2D parametric formulas with neural/kernel regression to fit L(N,D) and use 1D fits with irreducible loss for interpretable exponents.
- Assumptions/dependencies: agreed-upon evaluation splits and error metrics; careful handling of outliers and hyperparameter sweeps.

Policy

Evidence-based compute governance and transparency
- Require reporting of compute-optimal planning, irreducible loss fits, and whether embeddings are included in parameter counts to standardize disclosures and comparability.
- Assumptions/dependencies: community consensus on reporting templates; regulator familiarity with scaling-law terminology.
Practical reproducibility with low compute
- Promote shallow-model baselines (2-layer, short context) for independent verification of scaling behavior before large-scale pretraining investments.
- Assumptions/dependencies: availability of open datasets (e.g., Fineweb-edu subsets) and reference code.

Daily Life

Cost-effective open-source model development
- Community projects can use shallow transformers and synthetic benchmarks to validate methods before scaling, reducing compute costs and environmental impact.
- Assumptions/dependencies: accessible tooling; coordination on shared baselines and tokenizers.
Education and skill-building
- Hobbyist-friendly labs that explore sequence modeling with graphs and bigrams to understand LLM training fundamentals and scaling effects.
- Assumptions/dependencies: curated tutorials; modest GPU availability.

Long-Term Applications

These use cases likely require further research, scaling, or development before broad deployment.

Industry

Automated curriculum learning that dials dataset complexity
- Dynamic training pipelines that start with synthetic graph walks or bigrams, then progress to TnL and natural language to optimize scaling exponents and training efficiency.
- Potential tools/products: “CurriculumDialer” (complexity-aware data scheduler), “ExponentTracker” (online estimation of α, β, γ).
- Assumptions/dependencies: validated links between dataset entropy/structure and exponents; robust online fitting; stable model behavior under curriculum shifts.
Architecture and optimizer design targeting exponent improvements
- Systematically search for changes that increase α and β, improving asymptotic efficiency of deep learning methods.
- Assumptions/dependencies: reliable small-scale proxies for large-scale exponents; automated search infrastructure.
Enterprise compute marketplaces informed by scaling predictions
- Procurement platforms that allocate compute budgets based on predicted returns from L(N,D) fits, including parameterization choice (μP vs. standard).
- Assumptions/dependencies: integration with cloud/cluster schedulers; contractual energy/cost models.

Academia

Unified theory connecting sequence modeling with cross-entropy to observed power-law loss scaling
- Formal frameworks that explain exponents without relying on power-law data correlations, using graph/hypergraph sequence processes.
- Assumptions/dependencies: new mathematical tools; collaboration across theory and empirical groups.
Reasoning benchmarks via graph walks and hypergraphs
- Stepwise inference and chain-of-thought tasks grounded in controllable graphical structures to study scaling of reasoning capabilities.
- Assumptions/dependencies: benchmark standardization; agreement on evaluation metrics beyond perplexity.

Policy

Standards and audits for compute-optimal training
- Policies that require ex-ante scaling-law projections (with irreducible loss) and track deviation against actual results to manage energy use and environmental impact.
- Assumptions/dependencies: accepted audit methodologies; data access for regulators.
Funding and risk assessment frameworks
- Public or philanthropic funding models that evaluate proposals using low-compute scaling proxies to estimate likely gains from large-scale pretraining.
- Assumptions/dependencies: refined proxy-to-large-scale correlation; robust peer review processes.

Daily Life

Consumer-facing AI features trained with complexity-aware curricula
- Applications (assistants, educational tutors) trained with staged data complexity to lower costs while preserving performance quality.
- Assumptions/dependencies: transferability from synthetic pretraining to end-user tasks; productization pipelines.
Education platforms for scalable AI literacy
- Interactive tools where learners can tune graph parameters and observe how scaling exponents change, building intuition about model and data design.
- Assumptions/dependencies: funded platform development; accessible compute.

Notes on Assumptions and Dependencies

Most fits assume 1-epoch training, cross-entropy loss, and consistent tokenization; deviations can change observed exponents.
Including vs. excluding embedding parameters in N materially affects compute-optimal N/D splits and comparability across studies.
μP appears promising for parameter efficiency, but evidence is preliminary; confirm across diverse tasks and scales before broad adoption.
Synthetic benchmarks (graph walks, TnL) should be validated as proxies for real-language scaling when used to guide expensive training decisions.
Reliable estimates of irreducible loss (E) are critical; omitting E can severely understate exponents and mislead planning.
Outlier handling, bootstrap confidence intervals, and reporting of fit quality (MSE, Huber loss) are necessary for credible extrapolation.

View Paper Prompt View All Prompts

Glossary

The following is an alphabetical list of advanced domain-specific terms from the paper, each with a brief definition and a verbatim usage example from the text.

Adjacency matrix: A square matrix representing which nodes in a graph are connected by edges. Example: "The spectrum of the adjacency matrix obeys a semicircle law, which is typical of random symmetric matrices."
Auto-regressive sequence modeling: A modeling setup where the next element in a sequence is predicted based on previous elements. Example: "The above theories based on linear models with mean square error (MSE) loss are rather far from the setting of auto-regressive sequence modeling with cross-entropy loss."
BarabÃ¡si-Albert ensemble: A family of graphs generated via preferential attachment that produces scale-free degree distributions. Example: "Our results also include scaling laws obtained from training on random walks on random graphs drawn from ErdÃ¶s-Renyi and scale-free BarabÃ¡si-Albert ensembles."
BarabÃ¡si-Albert preferential attachment model: A graph-generation mechanism where new nodes attach preferentially to high-degree nodes, yielding a power-law degree distribution. Example: "One way to construct a scale-free graph is via the BarabÃ¡si-Albert preferential attachment model \citep{barabasi1999emergence}, which gives $\gamma = 3$ ."
Bias-corrected and accelerated bootstrap method: A statistical resampling technique providing improved confidence interval accuracy (BCa bootstrap). Example: "Brackets indicate 95\% confidence intervals obtained from bias-corrected and accelerated bootstrap method (see Appendix \ref{app:fitting} for details)."
Bigram graph: A graph representation where nodes are tokens and directed edge weights encode bigram probabilities. Example: "In the case of a bigram model ( $n = 2$ ), this defines a directed, weighted graph, which we refer to as the bigram graph."
Bigram model: An n-gram LLM with n=2 that conditions each token on the previous one. Example: "In the case of a bigram model ( $n = 2$ ), this defines a directed, weighted graph, which we refer to as the bigram graph."
Broken power law: A distribution that follows different power-law exponents over different ranges. Example: "Language results are from Fineweb-edu using GPT-2 tokenizer; unigram distribution is a standard Zipf plot, while bigram distribution shows broken power law."
Chain of thought reasoning: Multi-step reasoning where intermediate steps are explicit, akin to traversing a path in a graph. Example: "walks on graphs can be used to model stepwise inference and chain of thought reasoning \citep{khona2024understandingstepwiseinferencetransformers,Besta_2024}."
Chinchilla formula (2d): A two-parameter family used to fit test loss as a function of model size and data size. Example: "which we refer to as the 2d Chinchilla formula."
Compute optimal curves: Curves describing the best achievable loss as a function of fixed computational budget by optimally choosing model and data sizes. Example: "demonstrate an alternative method for obtaining compute optimal curves as compared with current practice in published literature"
Compute optimal scaling laws: Relationships describing how optimal loss and optimal choices of parameters/data scale with compute. Example: "demonstrate a simple neural network regression method to obtaining compute optimal scaling laws, which is different from methods presented in the literature to date"
Cross-entropy loss: A loss function measuring the divergence between predicted and true categorical distributions; standard in language modeling. Example: "auto-regressive sequence modeling with cross-entropy loss."
Decoder-only GPT-style transformer: A transformer architecture that predicts the next token using only a causal (decoder) stack. Example: "We use a decoder-only GPT-style transformer to perform next-token prediction."
Directed, weighted hypergraph: A generalization of graphs where edges (hyperedges) can connect more than two nodes and have direction and weights. Example: "In that case, the $n$ -gram model defines a directed, weighted hypergraph with $V$ nodes."
ErdÃ¶s-Renyi graph: A random graph model where each possible edge is included independently with a fixed probability. Example: "The simplest class of random graphs are ErdÃ¶s-Renyi graphs."
FLOPs (floating point operations): A measure of computational cost counting arithmetic operations. Example: "The compute in FLOPs (floating point operations) can be expressed exactly in terms of $N$ , $D$ , and number of training steps"
Huber loss: A robust loss function less sensitive to outliers, combining quadratic and linear behaviors. Example: "fit using a Huber loss ( $\delta = 10^{-3}$ ) as in \citep{hoffmann2022training}"
Hyperparameter sweep: Systematic exploration over hyperparameter configurations to find optimal settings. Example: "L(N,D) is taken to be the optimal loss over a hyperparameter sweep:"
Irreducible loss: The asymptotic floor of the loss due to intrinsic uncertainty/entropy in the target distribution. Example: "Dropping the constants $E$ (referred to as the irreducible loss) in the 1d fits"
Kernel regression: A nonparametric method for function approximation using kernel-weighted averages. Example: "Many theoretical works have shown that in linear or kernel regression, power laws in the test loss do in fact originate from power laws in the data"
Markov random walk: A sequence generation process moving between nodes with transition probabilities depending only on the current state. Example: "The simplest non-trivial generative model of sequences is a Markov random walk on a graph."
Maximal update parameterization (μP): A model parameterization scheme intended to stabilize training dynamics across scales. Example: "We also show preliminary results that maximal update parameterization \citep{yang2021tensor} might be more parameter-efficient than standard parameterization"
Mean square error (MSE) loss: A regression loss equal to the average squared difference between predictions and targets. Example: "The above theories based on linear models with mean square error (MSE) loss are rather far from the setting of auto-regressive sequence modeling with cross-entropy loss."
Preferential attachment: A mechanism where new nodes are more likely to connect to high-degree nodes, producing power-law degree distributions. Example: "One way to construct a scale-free graph is via the BarabÃ¡si-Albert preferential attachment model \citep{barabasi1999emergence}, which gives $\gamma = 3$ ."
Random walk, unbiased: A walk on a graph where transition probabilities depend only on uniform edge choices (or degree-normalized for undirected graphs). Example: "An unbiased random walk on an undirected graph has the transition probability $p(u|v) = \frac{A_{uv}{\text{deg}(v)}$"
Scale-free graph: A graph whose degree distribution follows a power law, meaning many low-degree nodes and a few hubs. Example: "A scale-free graph \citep{barabasi1999emergence} is defined such that a randomly chosen node has degree $k$ with probability $\propto k^{-\gamma}$ for large $k$ , with $\gamma > 1$ ."
Semicircle law: A result from random matrix theory where eigenvalue distributions follow a semicircle density. Example: "The spectrum of the adjacency matrix obeys a semicircle law, which is typical of random symmetric matrices."
Small-world graph: A graph with high clustering and short average path lengths compared to random graphs. Example: "Co-occurrence matrices from natural language corpora in particular have been shown to determine a scale-free, small-world graph"
Spectral gap: The difference between the largest and second-largest eigenvalues of a transition matrix; controls mixing rates. Example: "Recall that random walks have an exponentially decaying auto-correlation, with the time-scale for the exponential decay set by the spectral gap of the transition matrix."
Stationary distribution: A probability distribution over states that remains unchanged by the transition dynamics of a Markov chain. Example: "The stationary distribution is $p(u) = \frac{\text{deg}(u)}{2E}$ "
Stationary joint distribution: The stationary probability distribution over tuples of consecutive states for higher-order Markov processes. Example: "The initial $n-1$ nodes can be chosen from the stationary joint distribution on $n-1$ nodes."
Transition matrix: A matrix where entry (i,j) gives the probability of transitioning from state j to i; governs Markov dynamics. Example: "Recall that random walks have an exponentially decaying auto-correlation, with the time-scale for the exponential decay set by the spectral gap of the transition matrix."
Transformer-generated n-layer (TnL) datasets: Synthetic text datasets produced by sampling from trained transformers with n layers. Example: "we refer to these datasets as Transformer-generated $n$ -layer (TnL) datasets."
Tunable complexity: A property of datasets or tasks where structural difficulty can be systematically increased or decreased. Example: "would be to study sequence modeling with datasets of \it tunable complexity\rm,"
Unigram distribution: The marginal distribution over individual tokens (single-word frequencies). Example: "Sampling from a bigram model corresponds to sampling a random walk on the bigram graph, with transition probabilities $p(u|v) = p_2(u,v)/p_1(v)$ , where $p_1(v) = \sum_{u} p_2(u,v)$ is the stationary distribution on nodes, and equivalently the unigram distribution of the sequence."
Variance-limited regime: A scaling regime where performance improvements are limited by estimation variance rather than bias. Example: "Note that this can be thought of as the \"variance-limited\" regime for power law scaling of data in the language of \citep{bahri2021explaining}."
Zipf plot: A rank-frequency plot used to visualize Zipf-like distributions. Example: "Language results are from Fineweb-edu using GPT-2 tokenizer; unigram distribution is a standard Zipf plot, while bigram distribution shows broken power law."
Zipf's law: An empirical law stating that word frequencies in natural language are inversely proportional to their rank. Example: "it is well-known that the frequency of words in a corpus of text follows Zipf's law"
Two-dimensional parametric fits: Functional forms that model loss as a joint function of model and data sizes with a small set of parameters. Example: "It is common to consider two-dimensional parametric fits of the form \citep{hoffmann2022training,bhagia2025establishingtaskscalinglaws, muennighoff2023scaling,gadre2024overtraining,zhang2024map,kang2025demystifyingsyntheticdatallm}"

On the origin of neural scaling laws: from random graphs to natural language

Summary

On the Origin of Neural Scaling Laws: From Random Graphs to Natural Language — A Technical Analysis

Introduction and Context

Scaling Laws in the Absence of Data Power Laws

Evolution of Scaling Exponents with Data Complexity

Fitting Methodology and Critique of Prior Practice

Language Modeling: Short Context and Parameterization Effects

Random Walks on Scale-free and Biased Graphs

Theoretical and Practical Implications

Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Plain-English Guide to “On the origin of neural scaling laws: from random graphs to natural language”

Overview (what the paper is about)

Key questions (what the authors wanted to find out)

How they studied it (methods in everyday language)

Main findings (what they discovered and why it matters)

Why this is important (implications and impact)

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Industry

Academia

Policy

Daily Life

Long-Term Applications

Industry

Academia

Policy

Daily Life

Notes on Assumptions and Dependencies

Glossary

Open Problems

Continue Learning

Related Papers

Authors (3)

Collections

Tweets

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research