Robustness of infinite learning-rate schedules under larger token budgets and distribution shifts

Determine whether the performance recovery afforded by the annealing phase in the proposed infinite learning-rate schedules—comprising a one-time warm-up and cooldown to a constant learning rate across tasks followed by a final annealing—persists when training over substantially larger numbers of tokens and when successive datasets exhibit distribution shifts, beyond the IID split setting where this effect was observed.

Background

The paper proposes infinite learning-rate schedules for continual pre-training of LLMs to avoid re-warming between datasets, which the authors show can induce forgetting. These schedules warm up once, cool down to a constant learning rate that is maintained across tasks, and then include a final annealing phase to reach a low learning rate before deployment.

Empirically, the authors show that the annealing phase can recover performance after a long constant-rate phase in settings with IID splits of the same dataset. However, they highlight uncertainty about whether this recovery behavior will continue to hold when training over much larger token budgets and in scenarios where new datasets differ in distribution from prior ones.

References

While Fig.~\ref{fig:inf-lr-100bx3} showed that the annealing phase helps recover from this suboptimality in the case of IID splits of the same dataset, it is unclear if this would hold over more tokens, or in the case where the different datasets have distribution shifts.

— Simple and Scalable Strategies to Continually Pre-train Large Language Models (2403.08763 - Ibrahim et al., 13 Mar 2024) in Section 7, Limitations

Robustness of infinite learning-rate schedules under larger token budgets and distribution shifts

Sponsor

Background

References

Related Problems