Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Simple and Scalable Strategies to Continually Pre-train Large Language Models (2403.08763v4)

Published 13 Mar 2024 in cs.LG, cs.AI, and cs.CL
Simple and Scalable Strategies to Continually Pre-train Large Language Models

Abstract: LLMs are routinely pre-trained on billions of tokens, only to start the process over again once new data becomes available. A much more efficient solution is to continually pre-train these models, saving significant compute compared to re-training. However, the distribution shift induced by new data typically results in degraded performance on previous data or poor adaptation to the new data. In this work, we show that a simple and scalable combination of learning rate (LR) re-warming, LR re-decaying, and replay of previous data is sufficient to match the performance of fully re-training from scratch on all available data, as measured by the final loss and the average score on several LLM (LM) evaluation benchmarks. Specifically, we show this for a weak but realistic distribution shift between two commonly used LLM pre-training datasets (English$\rightarrow$English) and a stronger distribution shift (English$\rightarrow$German) at the $405$M parameter model scale with large dataset sizes (hundreds of billions of tokens). Selecting the weak but realistic shift for larger-scale experiments, we also find that our continual learning strategies match the re-training baseline for a 10B parameter LLM. Our results demonstrate that LLMs can be successfully updated via simple and scalable continual learning strategies, matching the re-training baseline using only a fraction of the compute. Finally, inspired by previous work, we propose alternatives to the cosine learning rate schedule that help circumvent forgetting induced by LR re-warming and that are not bound to a fixed token budget.

Simple and Scalable Strategies to Continually Pre-train LLMs

Introduction

The pre-training process of LLMs can be excessively computationally expensive, especially when new data become available, and the model needs an update. Traditionally, this requires starting the training process from scratch, combining the old and new datasets—a practice that is not only inefficient but also unsustainable. Given the dynamic nature of data and the continuous improvement in data quality, there is a pressing need for more efficient strategies to update these models. Our work focuses on developing and empirically evaluating simple and scalable continual learning techniques for LLMs, targeting the efficient integration of new data into existing models.

The Need for Efficient Update Strategies

Updating LLMs with new data introduces two main challenges: adaptation and forgetting. Adaptation entails effectively learning from the new data, while forgetting involves retaining previously learned information. Our hypothesis revolves around the premise that both challenges can be addressed with strategies involving the adjustment of learning rates and the use of replay mechanisms.

Learning Rate Adjustments

A common theme in open-source LLMs is the use of a cosine decay schedule for the learning rate. Upon analysis, it was clear that simply continuing training from the last stopped point (with a minimal learning rate) leads to suboptimal adaptation to new data. Thus, we propose "re-warming" the learning rate back to a higher value and then "re-decaying" it following a cosine pattern adjusted to the token budget of the new data. This approach aids in better adaptation.

The Role of Replay

Given the improved adaptation from learning rate adjustments, there arises a need to counteract the potential forgetting of previously learned information. We experimented with various percentages of replay—reusing a percentage of the old data in the training batches for the new data. Our results showcased that a well-chosen replay percentage effectively mitigates forgetting without significantly hampering adaptation to new data.

Empirical Evaluation and Results

Our extensive experiments involved pre-training models of 405M and 10B parameters on datasets of varying domain similarity and sizes (ranging up to 300B tokens). Our findings illustrate that a balanced combination of learning rate re-warming, re-decaying, and strategic replay can achieve performance on par with models trained from scratch on the amalgamated old and new datasets—while drastically reducing computational costs.

Learning Rate Schedules for Continual Pre-training

Analyzing further, we identified that re-warming the learning rate might induce unnecessary forgetting when transitioning from one dataset to another. We explored "infinite learning rate schedules" as an alternative that avoids re-warming by maintaining a constant learning rate across transitions between datasets. These schedules proved promising in our initial experiments, showcasing competitive performance with the traditional cosine decay schedules without the drawback of induced forgetting during transitions.

Conclusion

Our work substantiates the practicality of continually pre-training LLMs through simple and scalable strategies. By judiciously adjusting the learning rate and incorporating replay mechanisms, we can efficiently update LLMs with new data, maintaining or even surpassing the performance of models re-trained from scratch. Infinite learning rate schedules further present a promising direction for seamless model updates across multiple datasets. This research not only contributes to the efficiency of LLM pre-training but also opens avenues for maintaining up-to-date models in a more sustainable manner.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Adam Ibrahim (12 papers)
  2. Benjamin Thérien (12 papers)
  3. Kshitij Gupta (19 papers)
  4. Mats L. Richter (11 papers)
  5. Quentin Anthony (25 papers)
  6. Timothée Lesort (26 papers)
  7. Eugene Belilovsky (68 papers)
  8. Irina Rish (85 papers)
Citations (33)
Youtube Logo Streamline Icon: https://streamlinehq.com