- The paper introduces NEUNEU, a neural sequence model that extrapolates scaling laws for downstream task performance directly from token-level loss trajectories with a 38% MAE reduction.
- The paper employs a time-series extrapolation approach featuring a loss encoder, transformer backbone, and quantile regression to capture non-monotonic scaling behaviors.
- The paper's findings offer actionable insights for dynamic resource allocation and the development of foundation models to optimize neural network training dynamics.
Background and Motivation
This paper addresses the challenge of predicting downstream task performance of LLMs as a function of scale (parameters, compute, data). Classical neural scaling laws posit power-law relationships between training/validation loss and scale, enabling forecasts and budget allocation. However, while aggregate losses (e.g., perplexity) scale smoothly, downstream task metrics (e.g., classification accuracy) exhibit heterogeneous behaviors, including monotonic improvement, plateaus, or even degradation (inverse scaling), undermining the reliability of simple parametric scaling laws for task-level extrapolation.
Two main deficiencies are identified: (1) averaging token-level losses erases distributional information critical for generalization, and (2) no fixed parametric hypothesis class accommodates the full diversity of task scaling behaviors. The authors thus propose a data-driven approach using expressive neural sequence models that can directly learn from the raw training trajectories and token-level signal.
Methodology
NEUNEU Model
NEUNEU (Neural Neural Scaling Laws) reframes scaling law prediction as time-series extrapolation, leveraging temporal context and granular validation losses as input features. The architecture consists of three main modules:
- Loss Encoder: Employs a stack of strided 1D convolutional layers to process token-level validation probabilities—derived from cross-entropy losses—to build a hierarchical embedding.
- Transformer Backbone: Integrates encoded loss distribution with historical downstream accuracies and compute intervals (gaps) for contextual modeling. Inputs are abstracted as relative compute gaps to ensure invariance to absolute training scale.
- Quantile Regression Prediction Head: Projects the [CLS] token embedding to a fixed set of quantile predictions for future accuracy, allowing calibrated confidence intervals.
The model is trained on large swathes of open-source training trajectories from HuggingFace, covering several model families (DataDecide suite, Pythia) and 66 OLMES downstream tasks.
Multiple ablation baselines compare the effect of retaining or losing distributional information:
- Token probabilities (NEUNEU) are directly passed to the loss encoder.
- Average probabilities (AVERAGE) collapse the distribution, similar to logistic scaling laws.
- Histogram binning (DIFFHIST) forms a binned representation, testing the utility of coarse distributional features.
Quantile regression (pinball loss) is applied to capture uncertainty, and randomized subsequence masking is employed during training to simulate prediction under partial observation.
Empirical Results
Numerical outcomes clearly demonstrate that NEUNEU yields superior accuracy:
- Overall mean absolute error (MAE) is 2.04%, a 38% reduction compared to logistic scaling laws (3.29% MAE).
- NEUNEU generalizes robustly zero-shot to unseen downstream tasks, heldout seeds, pretraining data distributions, and novel model family architectures.
- Calibration studies show the predicted 10–90% interquantile range captures approximately 75% of true values, closely matching theoretical expectations.
- Pairwise ranking accuracy for final model selection is 75.6% for NEUNEU (vs. 63.3% for the logistic baseline).
On a per-task basis in OLMES, NEUNEU almost universally outperforms parametric and neural alternatives, with distributional representations (token probabilities, histograms) contributing substantial predictive power lost by averaging.
Implications and Theoretical Significance
By eliminating the bottleneck of parametric functional forms and leveraging a pure neural approach, NEUNEU reveals that limitations of logistic scaling laws stem from hypothesis class restrictions rather than inherent difficulties of the scaling prediction task. The model's flexibility enables accurate task-level extrapolation even for non-monotonic or anomalous scaling phenomena, such as inverse scaling and emergent behaviors.
This recasts scaling law estimation as a meta-learning problem, with NEUNEU operating as a "foundation model" for training dynamics, analogous to world models in reinforcement learning. The practical benefits include informed resource allocation, dynamic hyperparameter or data mixture selection, and reduced empirical cost—especially as model training budgets and environmental impact escalate.
Theoretical advances are suggested by NEUNEU's capacity to extract interpretable features from loss distributions and accuracy trajectories, potentially informing new analytic or semi-parametric scaling theory.
Limitations and Future Directions
Current implementation assumes homogeneous validation sets; adaptation to token-level variation or generative downstream metrics remains open. Generative tasks (e.g., text completion, open-ended question-answering) may display scaling curves distinct from those of classification. Additionally, interpretability analyses of CNN/extracted features may illuminate more principled parametric predictors or guide hybrid analytic-neural scaling law frameworks.
Conclusion
"Neural Neural Scaling Laws" demonstrates that direct neural sequence models, incorporating granular token-level information and temporal context, are markedly superior to classical parametric scaling law formulations for forecasting downstream accuracy across diverse model scales and task types. This approach offers both operational advantages in model selection and training protocol design, and a conceptual advance toward foundation models for understanding, simulating, and optimizing neural network training dynamics. Future research in AI is likely to rely increasingly on data-driven scaling law estimation, leveraging the growing corpus of public training trajectories.