Mechanistic account for direct scaling laws of downstream accuracy

Develop a mechanistic account that explains why the empirical functional forms used in this paper accurately model downstream performance, specifically: (i) the direct compute-to-accuracy law −log Q = A C^{−α} (with Q as benchmark accuracy and C as training FLOPs at fixed token-to-parameter ratio), (ii) the parameters–tokens law −log Q = A N^{−α} + B D^{−β} (with N model parameters and D pretraining tokens), and (iii) the pass@k relation log(−log Q(C, k)) = log A + α log C + β log k + δ log C log k for code benchmarks; and determine how these forms arise from item-difficulty mixture models or error-decay processes.

Background

The paper proposes simple, direct scaling laws that accurately predict downstream benchmark accuracy from training compute. At fixed token-to-parameter ratio, the authors find that negative log accuracy follows a power law in training FLOPs (−log Q = A C{−α}). They extend this to a joint dependence on parameters and tokens (−log Q = A N{−α} + B D{−β}) and derive a functional form to capture pass@k behavior for code evaluation that incorporates both training compute and inference sampling (log(−log Q(C, k)) = log A + α log C + β log k + δ log C log k).

While these forms provide strong empirical fits across multiple benchmarks and settings, the paper emphasizes that a theoretical, mechanistic explanation is missing. In particular, connecting these empirical laws to principled models such as item-difficulty mixtures or error-decay processes would ground the observed S-shaped and cross-metric behaviors and clarify when and why the laws should hold.

References

We model downstream accuracy as a simple function of training compute (\Cref{eq:scaling_law_basic}) and extend to parameters–tokens (\Cref{eq:scaling_law_basic_ND}) and pass@k (\Cref{eq:pass_k_formula}), but we do not yet offer a mechanistic account of these forms; connecting them to item‑difficulty mixtures or error‑decay processes remains open.

Revisiting the Scaling Properties of Downstream Metrics in Large Language Model Training (2512.08894 - Krajewski et al., 9 Dec 2025) in Section 4.2 (Limitations and Future Work)