Neural Neural Scaling Laws

Published 27 Jan 2026 in cs.LG and cs.CL | (2601.19831v1)

Abstract: Neural scaling laws predict how LLM performance improves with increased compute. While aggregate metrics like validation loss can follow smooth power-law curves, individual downstream tasks exhibit diverse scaling behaviors: some improve monotonically, others plateau, and some even degrade with scale. We argue that predicting downstream performance from validation perplexity suffers from two limitations: averaging token-level losses obscures signal, and no simple parametric family can capture the full spectrum of scaling behaviors. To address this, we propose Neural Neural Scaling Laws (NeuNeu), a neural network that frames scaling law prediction as time-series extrapolation. NeuNeu combines temporal context from observed accuracy trajectories with token-level validation losses, learning to predict future performance without assuming any bottleneck or functional form. Trained entirely on open-source model checkpoints from HuggingFace, NeuNeu achieves 2.04% mean absolute error in predicting model accuracy on 66 downstream tasks -- a 38% reduction compared to logistic scaling laws (3.29% MAE). Furthermore, NeuNeu generalizes zero-shot to unseen model families, parameter counts, and downstream tasks. Our work suggests that predicting downstream scaling laws directly from data outperforms parametric alternatives.

Abstract PDF Upgrade to Chat

Summary

The paper introduces NEUNEU, a neural sequence model that extrapolates scaling laws for downstream task performance directly from token-level loss trajectories with a 38% MAE reduction.
The paper employs a time-series extrapolation approach featuring a loss encoder, transformer backbone, and quantile regression to capture non-monotonic scaling behaviors.
The paper's findings offer actionable insights for dynamic resource allocation and the development of foundation models to optimize neural network training dynamics.

Formal Summary of "Neural Neural Scaling Laws" (2601.19831)

Background and Motivation

This paper addresses the challenge of predicting downstream task performance of LLMs as a function of scale (parameters, compute, data). Classical neural scaling laws posit power-law relationships between training/validation loss and scale, enabling forecasts and budget allocation. However, while aggregate losses (e.g., perplexity) scale smoothly, downstream task metrics (e.g., classification accuracy) exhibit heterogeneous behaviors, including monotonic improvement, plateaus, or even degradation (inverse scaling), undermining the reliability of simple parametric scaling laws for task-level extrapolation.

Two main deficiencies are identified: (1) averaging token-level losses erases distributional information critical for generalization, and (2) no fixed parametric hypothesis class accommodates the full diversity of task scaling behaviors. The authors thus propose a data-driven approach using expressive neural sequence models that can directly learn from the raw training trajectories and token-level signal.

Methodology

NEUNEU Model

NEUNEU (Neural Neural Scaling Laws) reframes scaling law prediction as time-series extrapolation, leveraging temporal context and granular validation losses as input features. The architecture consists of three main modules:

Loss Encoder: Employs a stack of strided 1D convolutional layers to process token-level validation probabilities—derived from cross-entropy losses—to build a hierarchical embedding.
Transformer Backbone: Integrates encoded loss distribution with historical downstream accuracies and compute intervals (gaps) for contextual modeling. Inputs are abstracted as relative compute gaps to ensure invariance to absolute training scale.
Quantile Regression Prediction Head: Projects the [CLS] token embedding to a fixed set of quantile predictions for future accuracy, allowing calibrated confidence intervals.

The model is trained on large swathes of open-source training trajectories from HuggingFace, covering several model families (DataDecide suite, Pythia) and 66 OLMES downstream tasks.

Input Representation Analysis

Multiple ablation baselines compare the effect of retaining or losing distributional information:

Token probabilities (NEUNEU) are directly passed to the loss encoder.
Average probabilities (AVERAGE) collapse the distribution, similar to logistic scaling laws.
Histogram binning (DIFFHIST) forms a binned representation, testing the utility of coarse distributional features.

Quantile regression (pinball loss) is applied to capture uncertainty, and randomized subsequence masking is employed during training to simulate prediction under partial observation.

Empirical Results

Numerical outcomes clearly demonstrate that NEUNEU yields superior accuracy:

Overall mean absolute error (MAE) is 2.04%, a 38% reduction compared to logistic scaling laws (3.29% MAE).
NEUNEU generalizes robustly zero-shot to unseen downstream tasks, heldout seeds, pretraining data distributions, and novel model family architectures.
Calibration studies show the predicted 10–90% interquantile range captures approximately 75% of true values, closely matching theoretical expectations.
Pairwise ranking accuracy for final model selection is 75.6% for NEUNEU (vs. 63.3% for the logistic baseline).

On a per-task basis in OLMES, NEUNEU almost universally outperforms parametric and neural alternatives, with distributional representations (token probabilities, histograms) contributing substantial predictive power lost by averaging.

Implications and Theoretical Significance

By eliminating the bottleneck of parametric functional forms and leveraging a pure neural approach, NEUNEU reveals that limitations of logistic scaling laws stem from hypothesis class restrictions rather than inherent difficulties of the scaling prediction task. The model's flexibility enables accurate task-level extrapolation even for non-monotonic or anomalous scaling phenomena, such as inverse scaling and emergent behaviors.

This recasts scaling law estimation as a meta-learning problem, with NEUNEU operating as a "foundation model" for training dynamics, analogous to world models in reinforcement learning. The practical benefits include informed resource allocation, dynamic hyperparameter or data mixture selection, and reduced empirical cost—especially as model training budgets and environmental impact escalate.

Theoretical advances are suggested by NEUNEU's capacity to extract interpretable features from loss distributions and accuracy trajectories, potentially informing new analytic or semi-parametric scaling theory.

Limitations and Future Directions

Current implementation assumes homogeneous validation sets; adaptation to token-level variation or generative downstream metrics remains open. Generative tasks (e.g., text completion, open-ended question-answering) may display scaling curves distinct from those of classification. Additionally, interpretability analyses of CNN/extracted features may illuminate more principled parametric predictors or guide hybrid analytic-neural scaling law frameworks.

Conclusion

"Neural Neural Scaling Laws" demonstrates that direct neural sequence models, incorporating granular token-level information and temporal context, are markedly superior to classical parametric scaling law formulations for forecasting downstream accuracy across diverse model scales and task types. This approach offers both operational advantages in model selection and training protocol design, and a conceptual advance toward foundation models for understanding, simulating, and optimizing neural network training dynamics. Future research in AI is likely to rely increasingly on data-driven scaling law estimation, leveraging the growing corpus of public training trajectories.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Open Problems

We found no open problems mentioned in this paper.

Neural Neural Scaling Laws

Summary

Formal Summary of "Neural Neural Scaling Laws" (2601.19831)

Background and Motivation

Methodology

NEUNEU Model

Input Representation Analysis

Empirical Results

Implications and Theoretical Significance

Limitations and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Authors (5)

Collections

Tweets

Neural Neural Scaling Laws

Summary

Formal Summary of "Neural Neural Scaling Laws" (2601.19831)

Background and Motivation

Methodology

NEUNEU Model

Input Representation Analysis

Empirical Results

Implications and Theoretical Significance

Limitations and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Related Papers

Authors (5)

Collections

Tweets