Reliable hyperparameter transfer across scales

Develop reliable methods to transfer optimal training hyperparameters such as learning rate and initialization from small-scale proxy models to large-scale Large Language Models while guaranteeing stable training dynamics.

Background

Given the prohibitive cost of hyperparameter search at LLM scale, the paper discusses frameworks like maximal update parametrization (μP) and emerging scaling laws intended to enable zero-shot transfer of hyperparameters from small proxies. It notes promising empirical validations but also architectural sensitivities and limitations.

The authors frame the central unresolved issue as ensuring that such transfers are reliable across architectures, data regimes, and optimization dynamics, motivating principled methods and theoretical guarantees for cross-scale hyperparameter selection.

References

Consequently, a critical open question is how to reliably transfer optimal hyperparameters (e.g., learning rate, initialization) found on small-scale proxy models to large-scale target models.

Beyond the Black Box: Theory and Mechanism of Large Language Models  (2601.02907 - Gan et al., 6 Jan 2026) in Subsubsection Hyperparameter Transfer, Section 4: Training Stage (Advanced Topics and Open Questions)