Parameterisation for scaling beyond model size

Develop a parameterisation of training hyperparameters that supports reliable transfer when scaling beyond model size, specifically across batch size and training token horizon, rather than only width and depth, in order to preserve training dynamics and optimality when these axes change.

Background

Existing parameterisations such as μP and Depth-μP primarily address hyperparameter transfer across model width and depth. The paper highlights that large-scale training also involves scaling batch size and data (token horizon), axes for which hyperparameter transfer is not guaranteed by those parameterisations.

The authors therefore pose the open question of how to parameterise training to achieve transfer along these additional scaling modalities, motivating their subsequent SDE-inspired reparameterisation rules for AdamW and weight decay.

References

However, two questions remain open: How should we parameterise training for scaling beyond just model size?

Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration  (2512.22382 - Mlodozeniec et al., 26 Dec 2025) in Section 1 (Introduction)