Papers
Topics
Authors
Recent
Search
2000 character limit reached

Replaying Pre-Training Data

Updated 7 March 2026
  • Replaying pre-training data is the practice of reintroducing original training samples during later training phases to prevent catastrophic forgetting.
  • It employs methods such as proportional mixing, curriculum schedules, and buffer-based streaming to balance retention and adaptation across domains.
  • Optimized replay strategies enhance both target performance and overall generalization, benefiting applications in language modeling, computer vision, and reinforcement learning.

Replaying pre-training data refers to the explicit reuse or reintroduction of data from a model’s original (or prior) pre-training distribution during later stages of training—such as continual pre-training, fine-tuning, or even self-supervised representation learning—either to mitigate catastrophic forgetting or, more recently, to enhance adaptation and data efficiency on target domains. This technique spans a wide range of neural network paradigms, from language modeling and computer vision to reinforcement learning, and is motivated by the observation that continually updating a model without revisiting its foundational data often results in the degradation of general capabilities or poor transfer to new tasks. Replaying pre-training data may involve streaming, reservoir sampling, buffer-based strategies, or proportional mixing in multi-objective losses. The appropriate use and scheduling of replayed data is subject to theoretical, empirical, and domain-specific considerations, affecting both target and generalization performance.

1. Theoretical Foundations and Objectives

The core rationale for replaying pre-training data is to navigate the stability–plasticity dilemma intrinsic to continual learning systems: protecting previously-acquired knowledge while simultaneously enabling plastic adaptation to new distributions. Theoretical treatments in continual supervised and self-supervised learning regimes formalize this as minimizing a risk (or loss) on target data while controlling for drift in performance on the original pre-training distribution. In fine-tuning, for example, combining the target empirical risk Fn(θ)F_n(\theta) with a weighted pre-training risk Hm(θ)H_m(\theta) yields an aggregate objective of the form

minθαFn(θ)+(1α)Hm(θ)\min_{\theta} \, \alpha F_n(\theta) + (1-\alpha) H_m(\theta)

where α(0,1]\alpha \in (0,1] controls emphasis on the target task. The excess risk bound on the target task can be tightened by replaying pre-training samples that are similar (in gradient or embedding space) to the target domain (Liu et al., 2021). In continual pre-training (CPT) for LLMs, data replay is essential for preventing catastrophic forgetting of general capabilities—conceptually formalized as minimizing

LCPT(θ)=(1r)Ltgt(θ)+rLsrc(θ)L_{\mathrm{CPT}}(\theta) = (1-r) L_{\mathrm{tgt}}(\theta) + r L_{\mathrm{src}}(\theta)

with rr as the replay ratio (Zheng et al., 2024, Gu et al., 2024). In self-supervised streaming, replay or parameter regularization similarly bridges the gap to joint training (Hu et al., 2021).

2. Replay Mechanisms and Scheduling Strategies

Multiple replay mechanisms are employed, varying by domain and training context:

  • Proportional Mixing: Interleaving batches from pre-training and new datasets with fixed or adaptive ratios. For CPT, optimal replay ratios are empirically set between 0.1–0.3 for general tasks, balancing retention and adaptation. In fine-tuning, values as high as 0.5–0.6 are optimal when target domain data is exceedingly scarce (Kotha et al., 5 Mar 2026, Zheng et al., 2024).
  • Curriculum Schedules and Replay Timing: Replay may be constant throughout, or scheduled for specific epochs/steps. In continued LLM pretraining, a phase transition is triggered when the learning rate reaches a fraction of its maximum—switching from a "general blend" to a "QA blend" to optimize transfer and retention (Parmar et al., 2024).
  • Buffer-Based Streaming: Streaming regimes implement replay using reservoir buffers, FIFO sliding windows, or prioritized sampling, maintaining a subset of old data for efficiency. Sizes are typically capped at 5–10% of the cumulative stream under practical memory constraints (Hu et al., 2021).
  • Similarity-Driven or Task-Selective Replay: In vision, optimal transport and embedding-based selection isolate pre-training subsets most aligned with the target to be replayed, rather than indiscriminately sampling from the whole source corpus (Liu et al., 2021).
  • Regularized Replay: In parameter-efficient fine-tuning (e.g., LoRA adapters), KL-regularization toward the base model can be paired with data replay, often of approximate web-text proxies, further stabilizing representations while allowing efficient instruction adaptation (Riemer et al., 26 Dec 2025).

3. Data-Efficiency and Target-Task Performance Gains

Empirical studies consistently report that replaying pre-training data not only prevents forgetting but also markedly improves data efficiency for the target task, even in less-related or low-overlap scenarios:

Study Data-Efficiency Gain Replay Ratio ρ Regime Scale
(Kotha et al., 5 Mar 2026) up to 1.87× 0.5–0.6 Fine-tune, 150M 4M target / 4B total
(Kotha et al., 5 Mar 2026) up to 2.06× 0.3–0.8 Mid-training 4M / 4B
(Zheng et al., 2024) Target accuracy ≈1–2× 0.1–0.3 CPT, cross-lingual 1.4B–5B
(Parmar et al., 2024) +9% rel. acc. Hybrid (phased) Continued pretrain, LLM 15B
(Liu et al., 2021) +1%–3% top-1 UOT-based Fine-tuning, vision ≲100M
(Riemer et al., 26 Dec 2025) ~3–7× less forgetting 1–3 LoRA SFT, replay+KL 1.5–14B

In large-scale LLM adaptation, replaying pre-training data increased web navigation success by 4.5% and Basque QA accuracy by 2% (Kotha et al., 5 Mar 2026).

Replay is maximally beneficial when target domain data is rare or underrepresented in pre-training. The marginal utility diminishes as more target data is incorporated early in training, or as the domain gap narrows.

4. Trade-offs, Mixture Laws, and Predictive Scaling

A central challenge is optimizing the mixture ratio between replayed and new data to simultaneously (a) preserve base model competence (measured on general benchmarks) and (b) maximize target domain adaptation. The CMR ("Critical Mixture Ratio") scaling law models this trade-off explicitly, fitting power laws to general and domain losses as functions of tokens and mixture ratio, then solving for the largest replay fraction that maintains general loss within a user-specified budget (Gu et al., 2024).

Let Lgen(R,T)L_{\mathrm{gen}}(R, T) and Ldom(R,T)L_{\mathrm{dom}}(R, T) be general and domain loss after TT tokens at mixture RR. The CMR is computed by fitting these as

ΔLdom(R,T)=α1(R)Ts1+β1(R)\Delta L_{\mathrm{dom}}(R, T) = \alpha_1(R) T^{s_1} + \beta_1(R)

ΔLgen(R,T)=α2(R)Ts2+α3(R)Ts3+β2(R)\Delta L_{\mathrm{gen}}(R, T) = \alpha_2(R) T^{s_2} + \alpha_3(R) T^{s_3} + \beta_2(R)

then solving for R=RR = R^* that maintains general loss below threshold after TT tokens.

For example, with T=20BT=20\text{B} tokens, CMRs between 30–48% domain-data are optimal for various LLM scales (Gu et al., 2024). This law transforms scheduling from heuristic to analytically predictable, reducing resource wastage.

5. Risks, Pathologies, and Controversies

Although replay is widely effective, it is not universally benign. Recent theoretical work demonstrates that in over-parameterized linear and nonlinear settings, small or adversarially-chosen replay buffers can increase—rather than decrease—forgetting, depending critically on the principal angle between subspaces of tasks (Mahdaviyeh et al., 4 Jun 2025). If task subspaces are neither orthogonal nor fully overlapping, there exists a danger zone (principal angle ≈ π/4) where replay can exacerbate parameter drift, especially for small buffer sizes or poorly-chosen samples. Empirical results confirm non-monotonic and sometimes negative effects of replay in both linear regression and neural network tasks. These results indicate that replay’s efficacy depends on both the relationship between tasks and the adequacy of replay buffer size and diversity.

Empirical recommendations include:

  • Avoiding overly small or randomly-selected replay subsets when domain/task geometry is uncertain;
  • Curating replay buffers to ensure coverage of high-leverage directions or task modes;
  • Monitoring validation curves for non-monotonicity with respect to buffer size (Mahdaviyeh et al., 4 Jun 2025).

6. Application Domains and Practical Guidelines

Replay of pre-training data is deployed across major modalities:

  • Language Modeling: Replaying general corpus tokens preserves out-of-domain abilities during domain specialization. Practical ratios are 10–30%, and CMR scaling allows domain-specific adaptation within generality constraints (Zheng et al., 2024, Gu et al., 2024, Kotha et al., 5 Mar 2026).
  • Computer Vision: Fine-tuning on small or fine-grained datasets benefits from replaying selected pre-training classes or clusters that maximize representation similarity, especially with information-theoretic subset selection (e.g., unbalanced optimal transport) (Liu et al., 2021).
  • Self-Supervised Learning (SSL): Streaming pre-training with replay buffers closes nearly all the performance gap to joint training, with typical buffer sizes at 5–10% of the cumulative stream. For distribution shifts between chunks, replay plus regularization (e.g., MAS, EWC) is robust (Hu et al., 2021).
  • Parameter-Efficient Adaptation: LoRA and similar adapters used in LLMs are susceptible to forgetting, even for small instruction-tuning runs. Combining KL-regularization with replay from approximate pre-training distributions nearly eliminates forgetting at modest compute overhead (Riemer et al., 26 Dec 2025).
  • Data Recycling: When high-quality organic data is scarce, synthetic replay-like generation (e.g., RL-rewarded rephrasers) can multiply data efficiency by 2–3× (Yu et al., 12 Oct 2025).

General guidelines include:

  • Always retain (or synthesize) access to representative slices of base data for replay.
  • Set replay ratios empirically within 10–50%, adjusting for domain disparity and data scarcity.
  • Prefer subset or representation-based selection over uniform random sampling when feasible.
  • Combine replay with regularization or KL-penalties in parameter-restricted fine-tuning.
  • Use scaling laws to optimize mixture and efficiently allocate limited compute (Gu et al., 2024).

7. Extensions, Open Problems, and Future Directions

Emerging directions include dynamic or context-dependent curriculum replay—placing high-quality data at minima of the training re-evaluation curve (TREC) to boost LLM transfer (Bergsma et al., 29 Sep 2025)—and reward-driven data rephrasers for corpus augmentation (Yu et al., 12 Oct 2025). Data-recycling pipelines are being extended to new modalities and to hybrid approaches that interleave programmatic and generative replay. In continual self-supervised settings, synthetic replay (i.e., model-generated proxies) is being developed for privacy-constrained medical imaging domains (Luo et al., 22 Dec 2025), though full method details remain underreported.

A persistent open question is the principled adaptation of replay strategies to mixture of experts, sparse networks, and extreme low-data or ultra-high-disparity regimes. Additionally, the theory of harmful replay and mechanisms to detect or remedy it remain active research areas. The optimal scheduling and buffer management in highly nonstationary online settings, and the translation of scaling-law approaches to multi-task or multi-modal environments, will be important for generalizing and automating replay schedules.


References:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Replaying Pre-Training Data.