Robustness of infinite learning-rate schedules under larger token budgets and distribution shifts
Determine whether the performance recovery afforded by the annealing phase in the proposed infinite learning-rate schedules—comprising a one-time warm-up and cooldown to a constant learning rate across tasks followed by a final annealing—persists when training over substantially larger numbers of tokens and when successive datasets exhibit distribution shifts, beyond the IID split setting where this effect was observed.
References
While Fig.~\ref{fig:inf-lr-100bx3} showed that the annealing phase helps recover from this suboptimality in the case of IID splits of the same dataset, it is unclear if this would hold over more tokens, or in the case where the different datasets have distribution shifts.
— Simple and Scalable Strategies to Continually Pre-train Large Language Models
(2403.08763 - Ibrahim et al., 13 Mar 2024) in Section 7, Limitations