Mechanistic origin of the diminishing WSD–SWA gap at large batch sizes
Investigate the mechanism underlying the empirical finding that the performance difference between a constant learning rate with weight averaging and the Warmup–Stable–Decay schedule decreases with dataset size N in the large-batch regime, rigorously testing the hypothesis that the deterministic edge-of-stability phenomenon—where gradient descent fails to learn features in the top subspace—accounts for this behavior, and characterize the conditions under which it occurs.
References
Note that the difference between constant with averaging and WSD decreases with N. We think this is due to a higher order effect caused by the deterministic edge of stability as studied in \citet{damian2023selfstabilization}, where gradient descent is unable to learn features in the top subspace. However, we leave the precise study of this phenomenon to future work.