Continual Pretraining in Foundation Models

Updated 14 January 2026

Continual Pretraining is a paradigm that sequentially adapts pretrained models using unsupervised learning on new, often domain-specific, data streams.
It employs techniques like data replay and parameter-efficient updates to mitigate catastrophic forgetting while balancing plasticity and stability.
This approach enhances generalization and compute efficiency across diverse tasks, from multilingual language models to vision applications.

Continual Pretraining (CP) is a paradigm in machine learning and foundation model development in which a large-scale model, initially pretrained on a broad, domain-general corpus, is further adapted by sequential, often domain-specific, unsupervised learning on new data streams. Unlike traditional fine-tuning, which typically focuses on supervised adaptation for downstream tasks and exposes models to small, labeled datasets, CP extends or reopens the pretraining phase itself, leveraging unlabeled or lightly processed data in a manner that preserves or improves global representation quality, domain adaptability, and knowledge retention for the expanding suite of application domains in vision, language, and multimodal domains. CP is central to efficient adaptation, knowledge refresh, and domain transfer for large-scale LLMs, vision transformers, and cross-modal architectures, and underpins many of the latest advances in multilingual, low-resource, and domain-specialized AI systems.

1. Formalization and Core Objectives

Continual Pretraining is formally defined as the following process. Starting with pretrained model parameters $\theta_0$ (trained on initial distribution $\mathcal{D}_0$ ), the model is updated by further unsupervised learning on a sequence of new data distributions $\mathcal{D}_1, \mathcal{D}_2, \ldots, \mathcal{D}_T$ , with the aim of improving or maintaining performance on both prior and new domains/tasks without catastrophic forgetting. The training loss is the same unsupervised (e.g. masked language modeling, next-token prediction) objective used in initial pretraining: $\mathcal{L}_{\text{CP}}(\theta) = \mathbb{E}_{x \sim \mathcal{D}_{\text{CP}}}\Bigl[ -\sum_i \log p_\theta(x_i \mid x_{<i}) \Bigr]$ where $\mathcal{D}_{\text{CP}}$ denotes the (possibly mixed) continual pretraining corpus (Chen et al., 2024, Sun et al., 2023, Yan et al., 2022).

Key objectives of CP include:

Efficient Domain/Language Adaptation: Injecting large-scale, broad, in-domain knowledge (terminology, style, facts) far beyond what supervised fine-tuning can furnish (Ueda et al., 4 Nov 2025, Chen et al., 2024).
Retention of Prior Knowledge: Mitigating catastrophic forgetting so that previously acquired representation quality, including general-domain knowledge, is preserved, even as the model specializes towards new domains or languages (Elhady et al., 30 May 2025, Cossu et al., 2022).
Compute and Data Efficiency: Delivering adaptation at a fraction of the FLOP/data costs of training a new model from scratch, as formalized by extended scaling laws (Zheng et al., 2024).
Representation Plasticity-Stability Tradeoff: Balancing adaptation speed (plasticity) and retention (stability), especially under severe distribution shifts (Guo et al., 2024, Elhady et al., 30 May 2025).
Generalization to Sequential and Cross-Domain Tasks: Enabling expansion of models' scope to new languages, reasoning skills, or sequential-decision domains without repeated full-model retraining (Ueda et al., 4 Nov 2025, Ma et al., 11 Apr 2025).

2. Methodological Variants and Technical Strategies

Multiple CP methodologies have emerged:

2.1 Naïve Full-Model CP

All model parameters are updated via unsupervised loss on the new corpus (e.g., language modeling, MIM), using standard optimizers and schedules. This approach is susceptible to catastrophic forgetting and requires careful scheduling and monitoring (Gupta et al., 2023, Cossu et al., 2022).

2.2 Parameter-Efficient CP (PEFT, Adapters, LoRA)

To minimize forgetting and computational cost, CP can restrict weight updates to a (layer-wise) subset of parameters:

Adapters: Inserted modules, such as in the AF-Adapter (attention and FFN extension per layer: only 17% of parameters updated), shown to significantly reduce forgetting while improving in-domain task scores (Yan et al., 2022).
Low-Rank Adaptation (LoRA): Only low-rank decomposition parameters on attention or FFN layers are updated, reducing memory and compute requirements (Chen et al., 2024, Ma et al., 11 Apr 2025).
ELO (Efficient Layer-Specific Optimization): Only the first and last layers are detached, retrained on target data, then realigned with a brief full-model fine-tuning. This yields 5–6.5× speedups and preserves source language skills with minimal degradation (Yoo et al., 7 Jan 2026).

2.3 Data Mixing, Replay, and Curriculum

Domain and Language Replay: Mixing a fraction of original data (e.g., 10–30%) with the target data during CP prevents catastrophic forgetting and supports emergent abilities in new languages (Elhady et al., 30 May 2025, Zheng et al., 2024, Guo et al., 2024).
Quality-Based Subsetting: Selecting high-perplexity (low-quality) samples for removal or constructing a high-quality core corpus accelerates domain adaptation and reduces training cost (Guo et al., 2024, Vo et al., 2024).
Controlled Introduction of Synthetic Data: Generating instruction-following or scientific QA pairs, integrated gradually into the token stream, leads to improved reasoning and language skills, as well as more robust generalization across domains (Chen et al., 2024, Ishibashi et al., 15 May 2025).
Perplexity-Ordered Curriculum: Schedules that introduce "easier" data first and ramp up difficulty mitigate the stability gap at initialization (Chen et al., 2024).

2.4 Loss Schedules and Regularization

LP-FT (Linear-Probe–Finetune): A two-stage recipe where heads are adapted first, then the backbone is fine-tuned. This schedule reduces destructive drift, preserves past representations, and attains state-of-the-art open-domain performance with no explicit regularizer (Sun et al., 2023).
Contrastive Regularization, Pseudo-Feature Replay: Explicitly regularizing new-task feature distances with past class centroids or replayed statistics, often combined with prompt-based PEFT (Wang et al., 2023).

3. Empirical Findings and Scaling Laws

Continual Pretraining yields consistent, empirical benefits across modalities, tasks, and languages:

Language and Multimodal: CP narrows the performance gap for low- and mid-resource languages, amplifies scientific and reasoning skills, and enhances in-context learning emergence with proper data replay (Elhady et al., 30 May 2025, Chen et al., 2024, Chen et al., 2024).
Vision and Geospatial: Multi-objective continual pretraining on compact-yet-diverse datasets with loss components (e.g., feature-map distillation) achieves superior accuracy on change detection, scene classification, segmentation, and super-resolution vs. scratch or domain-finetuned baselines with 8–10× lower resource/carbon cost (Mendieta et al., 2023).
Downstream Task Impact: Empirical benchmarks in biomedical NLP (Yan et al., 2022), multilingual adaptation (Li et al., 5 Apr 2025), recommendation (Ma et al., 11 Apr 2025), and finance (Ueda et al., 4 Nov 2025) confirm that CP can increase downstream metric scores by 0.6–12 absolute points, frequently at reduced compute cost and parameter updates.
Scaling Law Extension: The extended Chinchilla Law for CP introduces an $N^{-\gamma}$ term, capturing transfer benefit scaling with model size ( $\gamma > 0$ ), so that CP models reach the same validation loss with 25–50% fewer FLOPs than training from scratch (Zheng et al., 2024).
Stability Gap Phenomenon: Under severe distribution shift, models exhibit an initial “V-shaped” dip and recovery in accuracy, with parameter drift magnitude tightly linked to retention of prior (especially in-context) abilities (Guo et al., 2024, Elhady et al., 30 May 2025).

4. Catastrophic Forgetting and Mitigation Mechanisms

A central challenge in CP is catastrophic forgetting―the loss of previously acquired knowledge on tasks or domains not currently observed during CP:

Replay and Data Mixing: Injecting 5–30% tokens from the pretraining (source) distribution during CP (either as full examples or via scheduled mixing in each batch) essentially eliminates catastrophic forgetting and preserves emergent abilities (e.g., ICL, multilingual generalization) at negligible compute overhead (Elhady et al., 30 May 2025, Zheng et al., 2024, Guo et al., 2024).
Parameter Update Restriction: Limiting updates to adapters, LoRA layers, or only a subset of model layers drastically reduces the drift of parameters most responsible for global representation, limiting forgetting while still allowing plasticity to new domains (Yan et al., 2022, Yoo et al., 7 Jan 2026, Ma et al., 11 Apr 2025).
EMA and Smoothing: Exponential Moving Average of model weights (e.g., $\beta = 0.92$ ) stabilizes the rapid parameter trajectory in the early “critical period,” preserving in-context learning and downstream accuracy without requiring any pretraining data replay (Elhady et al., 30 May 2025).
Instruction Templating: For chat or RLHF-aligned LLMs, wrapping new-language tokens in original instruction templates (e.g., putting all target-language output in the “assistant” slot) prevents conversational collapse and preserves safety/alignment without architectural changes (Chen et al., 2024).
Multi-Epoch High-Quality Sub-Corpus: Iterative training on a small, high-quality subset for several epochs accelerates recovery from the stability gap and improves final performance with reduced compute and data requirements (Guo et al., 2024, Vo et al., 2024).

5. Domain-Specific and Emerging Applications

Continual Pretraining is foundational for a range of key applications:

Application	CP Approach	Empirical Impact
Multilingual LLM Adaptation	Replay/EMA, code-mixed data, curriculum	+6.2% (qual.), +9.1% (classif. gains, low-resource)
Scientific/Reasoning LLMs	Synthetic QA pairs, multitopic mixture	+12% on MATH, +4.13% on SciEval, adaptive CoT depth
Domain-Specific LMs/BioMed	Adapter-based CP, task-specific masking	+2.0% on CBLUE, 11% rel. reduction in forgetting
Recommendation Systems	All-domain CP, phase-wise mixing, prompt-tuning	+8.0% HR@1 over prior LLM baselines
Vision (Geospatial, Gen.)	Multi-objective (MIM, feature distill)	Up to 14% F1 gain (OSCD), 3×–10× less compute

CP methodologies generalize to any context where domain knowledge, style, or behavior must be acquired efficiently and robustly, including recommendation, QA, translation, biomedical text, and even uncertainty quantification for self-evolving LLMs (Zhou et al., 27 Oct 2025).

6. Practical Considerations and Best Practices

Empirical evidence and theoretical analysis yield several best practices:

Mixture Ratios and Replay: Empirically, replaying 10–30% source data stabilizes CP, preserves general capabilities, and efficiently supports cross-lingual and cross-domain transfer (Zheng et al., 2024, Guo et al., 2024).
Adapter or Layer Selection: Limiting updates to adapters or to first/last transformer layers (ELO) ensures fast, memory-efficient CP and minimal forgetting; especially warranted for resource-constrained or large-scale settings (Yoo et al., 7 Jan 2026, Yan et al., 2022).
Curriculum and Synthetic Data Scheduling: Scheduling easy-to-hard progression in new-domain data, and carefully balancing high-quality synthetic QA, outperforms random or discipline-split scheduling (Chen et al., 2024).
Learning Rate Rewarming and Scheduling: Warm-up length is not critical (typically ≤1% total tokens), but appropriate selection of maximum LR and its decay is essential for controlling the plasticity-stability tradeoff (Gupta et al., 2023).
Generalization Assessment: Forgetting metrics, such as the drop in validation accuracy on the original domain or ICL/MCQA emergence, should be used to track CP stability; both kNN/layerwise probes and full fine-tuning should be evaluated for robust analysis (Sun et al., 2023, Cossu et al., 2022).
Merging Multiple CP Experts: When building multi-skill LLMs, start by merging each CP expert with the base, then merge cross-domain experts via sparse and sign-consistent methods (TIES, TA), and always monitor model similarity to prevent divergence (Ueda et al., 4 Nov 2025).

7. Limitations and Open Problems

Open challenges and emerging areas include:

Scaling to Larger Architectures: Most sophisticated ablations (e.g., hierarchical objective decompositions, algorithmic merging) have yet to be fully validated on the latest trillion-parameter models (Wang et al., 2023, Ueda et al., 4 Nov 2025).
Catastrophic Forgetting Under Severe Shift: Full theoretical characterization of parameter trajectory limits, especially under adversarial or highly divergent target data, remains open (Elhady et al., 30 May 2025, Guo et al., 2024).
Emergent Generalization and Stability/Plasticity Control: Tradeoffs between over-constraining and under-constraining parameter drift are currently tuned empirically; adaptive schedules and theoretical guarantees are subjects of ongoing research (Elhady et al., 30 May 2025, Ueda et al., 4 Nov 2025).
Integration with Downstream Continual Learning: Jointly optimizing upstream CP and downstream continual learning (e.g., class-incremental scenarios) could yield further improvements but is rarely explored (Wang et al., 2023).

Practitioners are encouraged to combine CP with complementary strategies, such as self-supervised objectives, data curriculum, adapter-based PEFT, and principled data mixing, while actively monitoring drift and downstream performance metrics on both original and novel domains to ensure robust, efficient, and scalable foundation model evolution.