Mechanistic origin of the diminishing WSD–SWA gap at large batch sizes

Investigate the mechanism underlying the empirical finding that the performance difference between a constant learning rate with weight averaging and the Warmup–Stable–Decay schedule decreases with dataset size N in the large-batch regime, rigorously testing the hypothesis that the deterministic edge-of-stability phenomenon—where gradient descent fails to learn features in the top subspace—accounts for this behavior, and characterize the conditions under which it occurs.

Background

In the large-batch ablation, the authors observe that constant learning rate with averaging increasingly closes the gap to WSD as N grows. They hypothesize that this may relate to deterministic edge-of-stability dynamics that limit learning of top-subspace features but do not provide a definitive explanation.

They explicitly defer a precise study of this phenomenon to future work, leaving open a targeted analysis to confirm the mechanism and delineate when it arises.

References

Note that the difference between constant with averaging and WSD decreases with N. We think this is due to a higher order effect caused by the deterministic edge of stability as studied in \citet{damian2023selfstabilization}, where gradient descent is unable to learn features in the top subspace. However, we leave the precise study of this phenomenon to future work.

Anytime Pretraining: Horizon-Free Learning-Rate Schedules with Weight Averaging  (2602.03702 - Meterez et al., 3 Feb 2026) in Section 3, Empirical Findings, Large Batch Setting