Efficacy of LR re-warm/re-decay and replay strategies under broader continual pretraining conditions

Verify the efficacy of combining learning-rate re-warm/re-decay and replay for continual pretraining under larger distribution shifts, scaled model and dataset sizes, infinite learning-rate schedules, growing model architectures, and tokenizer adaptation to larger distribution changes.

Background

The referenced study showed that re-warming/re-decaying learning rates with modest replay can match full retraining baselines in two-task settings.

The review emphasizes that broader conditions—such as larger distribution shifts, scaling, and tokenizer adaptation—remain untested.

References

While the experiments updated the model on two subsequent tasks, the approach's efficacy in settings involving larger distribution shifts, model and dataset scales, infinite LR schedules, growing models, and tokenizer adaptation for handling larger changes in data distribution remains to be verified.

Towards Incremental Learning in Large Language Models: A Critical Review (2404.18311 - Jovanovic et al., 28 Apr 2024) in Section 2.1 (Continual Learning) – Simple and Scalable Strategies to Continually Pre-train LLMs