Parameterisation for scaling beyond model size
Develop a parameterisation of training hyperparameters that supports reliable transfer when scaling beyond model size, specifically across batch size and training token horizon, rather than only width and depth, in order to preserve training dynamics and optimality when these axes change.
Sponsor
References
However, two questions remain open: How should we parameterise training for scaling beyond just model size?
— Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration
(2512.22382 - Mlodozeniec et al., 26 Dec 2025) in Section 1 (Introduction)