Papers
Topics
Authors
Recent
2000 character limit reached

Continual Pretraining (CPT) Overview

Updated 10 December 2025
  • Continual Pretraining (CPT) is a paradigm where pretrained foundation models are continually updated via self-supervised objectives on new data, integrating domain-specific knowledge while preventing catastrophic forgetting.
  • By interleaving unsupervised adaptation with downstream specialization, CPT leverages techniques like replay buffers and distillation to efficiently balance retaining general capabilities with learning new tasks.
  • Empirical advances show that CPT improves domain adaptation, language expansion, and scaling efficiency, reducing retraining compute costs by up to 60% compared to full model retraining.

Continual Pretraining (CPT) is a paradigm wherein the parameters of a pretrained foundation model—across vision, language, and multimodal domains—are continually updated via self-supervised (or composite) objectives on new data distributions, domains, or modalities. In contrast to frozen-feature transfer or isolated fine-tuning, CPT interleaves ongoing unsupervised or weakly supervised adaptation before or alongside downstream specialization. This approach aims to integrate new skills or knowledge while mitigating catastrophic forgetting, enabling efficient domain adaptation, language expansion, and multi-task model evolution at a fraction of the cost and environmental impact associated with full retraining.

1. Formal Definition and Core Objectives

CPT reuses parameters θ0\theta_0 from a pretrained model—such as a LLM, vision transformer, or speech model—and continues optimizing a self-supervised loss over a new corpus DCPTD_{\text{CPT}}. The canonical CPT loss in language is

LCPT(θ)=ExDCPTt=1xlogPθ(xtx<t)L_{\text{CPT}}(\theta) = -\mathbb{E}_{x \sim D_{\text{CPT}}}\, \sum_{t=1}^{|x|}\log P_\theta(x_t|x_{<t})

for next-token prediction, or its analogs for masked-token or masked-image objectives. In foundation models, CPT is typically applied to incrementally encode new distributions (e.g., domain-specific text, satellite images, classroom audio, underrepresented languages) while seeking to preserve general capabilities from the original pretraining (Mendieta et al., 2023, Wang et al., 5 Oct 2024, Wu et al., 2 Feb 2024, Elhady et al., 30 May 2025, Zheng et al., 2 Jul 2024, Shi et al., 24 Feb 2025, Attia et al., 15 May 2024).

CPT is the first and central stage of the modern continual learning framework for large models, typically preceding (i) continual instruction tuning (supervised adaptation) and (ii) continual alignment (preference/behavioral reinforcement), as formalized in (Wu et al., 2 Feb 2024).

2. Canonical Methodologies and Loss Structures

CPT encompasses a spectrum of formulations, often combining multiple objectives:

  • Pure self-supervised continuation: Resume the original masked modeling or autoregressive objective on domain/target data (e.g., masked language modeling for LLMs, contrastive/quantization objectives for SSL speech models).
  • Multi-objective CPT: Incorporate auxiliary loss terms for distillation or feature matching, as in geospatial foundation modeling, where a feature-distillation loss (matching partial-context student to teacher features) is combined with masked image modeling (Mendieta et al., 2023).
  • Hybrid architectures: For cross-modal adaptation (e.g., text-to-speech LLMs), CPT may interleave modalities (joint next-token objectives over both original tokens and new representations, such as codec-based audio tokens), with explicit data-mix or multi-task sampling (Shi et al., 24 Feb 2025).
  • Regularization and anti-forgetting: To prevent the erosion of general capacities, CPT methods may use one or more of: (a) replay buffers (generative or sampled) from older distributions; (b) distillation-based regularizers enforcing similarity of logits or intermediate activations to pre-CPT models; (c) parameter-isolation or masking (adapters/LoRA/layer-freeze) (Wu et al., 2 Feb 2024, Jiang et al., 15 Jul 2024, Attia et al., 15 May 2024, Nag et al., 13 Dec 2024).

The general joint CPT objective can be written as

LCPT-total(θ)=Lnew(θ;DCPT)+λR(θ,θold)L_{\text{CPT-total}}(\theta) = L_{\text{new}}(\theta; D_{\text{CPT}}) + \lambda\cdot R(\theta, \theta_\text{old})

where LnewL_{\text{new}} is the task objective on new data, RR is a replay or regularization term (Kullback-Leibler, quadratic, or feature-based), and λ\lambda is a tradeoff coefficient (Wu et al., 2 Feb 2024, Jiang et al., 15 Jul 2024).

3. Empirical Advances: Efficiency, Transfer, and Domain Adaptation

CPT provides substantial gains in efficiency and adaptability relative to full retraining:

4. Advanced Procedures and Recent Innovations

Recent literature explores sophisticated CPT extensions:

  • Multi-objective and staged CPT: The geospatial GFM paradigm employs feature-distillation alongside domain adaptation; Mix-CPT in LLMs uses logit-swap self-distillation and mixed-data objectives to couple knowledge acquisition with utilization while decoupling format alignment (Mendieta et al., 2023, Jiang et al., 15 Jul 2024).
  • Curricula and data selection: PPL-based curricula, synthetic example generation (QA, code, chain-of-thought), and corpus ranking/fragmentation analyses allow for targeted, efficient CPT especially for low-resource languages and scientific/technical domains (Chen et al., 26 Jul 2024, Nag et al., 13 Dec 2024, Ishibashi et al., 15 May 2025).
  • Path-switching and version-updating: Hierarchical CPT schedules, where a mainline is trained with high LR to produce a robust “flat” initialization and per-update branches decay toward convergence, optimize both performance and training time in sequential LLM versioning (Wang et al., 5 Oct 2024).
  • Scaling law control: D-CPT Law and its cross-domain generalization provide closed-form predictions of general and domain loss as functions of model size, data size, and mixture ratio, supporting rapid hyperparameter selection across new domains with minimal pilot data (Que et al., 3 Jun 2024).

5. Mitigating Catastrophic Forgetting

CPT fundamentally seeks to mitigate catastrophic forgetting, the degradation of previously acquired general capabilities when models adapt to new data:

6. Applications and Empirical Benchmarks

CPT has demonstrated gains in diverse application domains:

7. Analytic Frameworks and Future Directions

Scaling laws and analytic tools for CPT are now foundational:

  • CPT scaling laws: Closed-form models for loss trajectories as a function of model size, data size, learning rate, and training steps enable precise planning of CPT duration, mixtures, and replay (with R2>0.97R^2>0.97 on held-out tasks and setups) (Que et al., 3 Jun 2024, Wang et al., 12 May 2025, Zheng et al., 2 Jul 2024).
  • Cross-domain prediction: “Cross-Domain D-CPT Law” accurately extrapolates the optimal mixture/loss behavior for a new target domain using minimal pilot runs (Que et al., 3 Jun 2024).
  • Efficient adaptation: Analytical criteria and grid/tuning strategies enable minimal-cost, optimal-overlap CPT for both general and domain-specific goals.

Open research directions include: development of computation-efficient (modular, sparse) architectures, dynamic detection of data distribution shift and autonomous CPT scheduling, explicit unlearning and provenance-aware adaptation, and exploration of CPT dynamics for multimodal, multi-lingual, and multi-domain fusions.


References:

(Mendieta et al., 2023, Wang et al., 5 Oct 2024, Liu et al., 2021, Chen et al., 26 Jul 2024, Shi et al., 24 Feb 2025, Wu et al., 2 Feb 2024, Que et al., 3 Jun 2024, Attia et al., 15 May 2024, Nag et al., 13 Dec 2024, Zheng et al., 2 Jul 2024, Vo et al., 21 Aug 2024, Attia et al., 13 Sep 2024, Jiang et al., 15 Jul 2024, Cossu et al., 2022, Sun et al., 2023, Li et al., 5 Apr 2025, Elhady et al., 30 May 2025, Ishibashi et al., 15 May 2025, Wang et al., 12 May 2025, Ueda et al., 4 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Continual Pretraining (CPT).