Mechanistic account for direct scaling laws of downstream accuracy
Develop a mechanistic account that explains why the empirical functional forms used in this paper accurately model downstream performance, specifically: (i) the direct compute-to-accuracy law −log Q = A C^{−α} (with Q as benchmark accuracy and C as training FLOPs at fixed token-to-parameter ratio), (ii) the parameters–tokens law −log Q = A N^{−α} + B D^{−β} (with N model parameters and D pretraining tokens), and (iii) the pass@k relation log(−log Q(C, k)) = log A + α log C + β log k + δ log C log k for code benchmarks; and determine how these forms arise from item-difficulty mixture models or error-decay processes.
Sponsor
References
We model downstream accuracy as a simple function of training compute (\Cref{eq:scaling_law_basic}) and extend to parametersâtokens (\Cref{eq:scaling_law_basic_ND}) and pass@k (\Cref{eq:pass_k_formula}), but we do not yet offer a mechanistic account of these forms; connecting them to itemâdifficulty mixtures or errorâdecay processes remains open.