Continual Pre-training (CPT)
- Continual Pre-training (CPT) is a method that extends pre-trained models with additional self-supervised training to boost domain adaptability while retaining learned capabilities.
- It employs techniques like experience replay, gradient alignment, and curriculum learning to mitigate catastrophic forgetting and optimize data mixtures.
- Empirical scaling laws and hyperparameter tuning guide CPT’s effective integration across multiple architectures, enhancing transfer performance and resource efficiency.
Continual Pre-training (CPT) is a paradigm for adapting LLMs and related neural models to new domains, languages, or tasks via incremental, data-efficient re-training on curated corpora. Unlike re-training from randomly initialized parameters, CPT extends a model’s pre-existing architecture and weights by further optimizing its standard self-supervised objectives on new unlabeled or weakly-labeled data. CPT is distinguished from fine-tuning by its scale (often billions of tokens), preservation of base model capacities, and general-purpose applicability across languages, modalities, and use cases. Various CPT frameworks have emerged to address domain adaptation, catastrophic forgetting, resource efficiency, and emergent capability stability.
1. Foundations and General Principles
CPT formally consists of initializing model parameters from a pre-trained checkpoint (θ₀) and continuing training with additional data (Dₜ), typically using the same self-supervised objective as for the original model. For causal LLMs, the loss remains: where denotes the current token and the model’s predictive distribution (Elhady et al., 30 May 2025). This procedure leverages the encoded knowledge and linguistic features of θ₀, facilitating rapid convergence and reducing compute compared to training from scratch (Zheng et al., 2 Jul 2024).
CPT is broadly applicable to a variety of architectures, including dense Transformers, Mixture-of-Experts (MoE) models (Thérien et al., 6 Mar 2025), LoRA-adapted SLMs (Chih et al., 2 Oct 2025), and even speech models (Shi et al., 24 Feb 2025, Attia et al., 15 May 2024). Empirical scaling laws confirm CPT’s superiority in compute efficiency and transfer performance across model sizes (from 40M to 5B parameters) (Zheng et al., 2 Jul 2024), with joint data–parameter scaling terms characterizing cross-domain or cross-lingual effectiveness.
2. Data Mixture Design and Catastrophic Forgetting
A hallmark challenge in CPT is the stability–plasticity dilemma: learning new knowledge while retaining previously acquired skills. Catastrophic forgetting—where adaptation to new distributions erases latent capabilities such as in-context learning (ICL)—has been extensively documented.
Mitigation approaches include:
- Experience replay: Maintain and sample a buffer of past examples during CPT; mixing a fraction α of old data (often 25–40%) into each batch drastically reduces forgetting with minimal compute overhead (Abbes et al., 3 Aug 2025, Zheng et al., 2 Jul 2024, Thérien et al., 6 Mar 2025).
- Gradient alignment/meta-experience replay (MER): Encourage positive dot-products between old and new data gradients, implemented with Reptile-style outer-loop updates (Abbes et al., 3 Aug 2025).
- Curriculum learning: In domain or cross-lingual CPT, introduce general corpus (e.g. English) early in training before phasing in new data, stabilizing parameter shifts that would otherwise erase emergent abilities (Elhady et al., 30 May 2025).
- Exponential Moving Average (EMA): Regularize CPT by applying EMA to model weights, controlling the magnitude and rate of parameter drift (Elhady et al., 30 May 2025).
- Self-distillation: Preserve the original model’s distribution via logit swap self-distillation, balancing adaptation and knowledge retention (Jiang et al., 15 Jul 2024).
- Data-centric regularization: For speech tasks, explicitly include a fraction of text-only data in each batch to stabilize representations (Shi et al., 24 Feb 2025).
Catastrophic forgetting is empirically quantified by increased validation loss on the original corpus, reduced downstream accuracy on tasks measuring generalization, and disruption of emergent abilities such as ICL—even when target-language perplexity continues improving (Elhady et al., 30 May 2025).
3. Scaling Laws and Hyperparameter Optimization
Performance during CPT is governed by a small set of scaling laws that unify pre-training, domain adaptation, and transfer regimes. Notable formulations (Wang et al., 12 May 2025, Que et al., 3 Jun 2024, Gu et al., 24 Jul 2024):
- CPT transfer loss curve:
where terms reflect learning rate scheduling (annealing areas) and distribution shift magnitudes.
- Domain-specific mixture laws: Predict held-out loss as a function of model size, dataset size, and mixture ratio (domain/general) using fitted coefficients: This enables efficient selection of optimal mixture ratios for domain adaptation without exhaustive grid search (Que et al., 3 Jun 2024).
- Critical Mixture Ratio (CMR) scaling: For a fixed compute and model size, the optimal ratio of domain data () satisfies a power law with token budget and model parameters (Gu et al., 24 Jul 2024). For LLMs in the 0.5–3.1 B parameter range, maximizes domain transfer without excessive loss of general ability.
- Replay vs. Model Size: Small replay rates (25–30%) are more effective than doubling model size, especially for models under 1B parameters (Abbes et al., 3 Aug 2025). Excessive replay (>50%) incurs diminishing returns.
- Learning Rate Schedules: Initial checkpoints should retain high loss potential (i.e., high LR, “plasticity”) for flexible adaptation. A complete LR decay curve in CPT is critical for optimal convergence. “Path switching” paradigms maintain branched updates for efficient version management (Wang et al., 5 Oct 2024).
4. Specialized CPT Regimens and Practical Pipelines
Cross-lingual CPT
In cross-lingual adaptation, CPT outperforms training from scratch, with scaling laws predicting optimal data–parameter allocation. Mixing a small buffer (5–30%) of source language data mitigates catastrophic loss of base language capabilities without compute penalty (Zheng et al., 2 Jul 2024, Elhady et al., 30 May 2025).
Domain Adaptation
Domain adaptation regimens incorporate both domain and general data, governed by predictive scaling laws (Que et al., 3 Jun 2024). Mix-CPT (knowledge mixture CPT + format alignment) decouples knowledge memorization from instruction alignment, improving performance on both domain and general tasks through staged training and self-distillation (Jiang et al., 15 Jul 2024).
Multimodal CPT
Speech LLMs and vision-LLMs benefit from CPT by mixing text-only and modality-specific samples, stabilizing their latent linguistic reasoning while enabling new modality synthesis or recognition (Shi et al., 24 Feb 2025, Cossu et al., 2022).
Efficient Adaptation with Resource Constraints
Efficient CPT for low-resource languages employs heuristic or statistically scored subset selection of corpus data and judicious vocabulary augmentation to achieve near full-CPT gains with orders-of-magnitude less compute (Nag et al., 13 Dec 2024, Chih et al., 2 Oct 2025). LoRA-style adapter-only optimization and aggressive batching further enable CPT on commodity hardware.
Specialized Tasks
Agentic CPT synthesizes tool use and planning trajectories for agentic LLMs, yielding strong performance in multi-step benchmarks when agentic data are available before post-training (Su et al., 16 Sep 2025). All-domain CPT for recommendation aligns LLM predictions with user behavior by mixing domain-specific and all-domain behavioral sequences and scheduling the transition through tailored learning rate curves (Ma et al., 11 Apr 2025).
5. Evaluation, Robustness, and Ablations
Empirical evaluation of CPT pipelines encompasses:
- Perplexity on domain and general held-out sets
- Downstream accuracy on language comprehension, reasoning, recommendation, or generation tasks
- In-context learning benchmarks (e.g., Copain), measuring emergent abilities and generalization (Elhady et al., 30 May 2025)
- Modality-specific metrics (WER for ASR, BLEU for MT, HR@k for recommendation)
- Catastrophic forgetting quantification through retention metrics, average forgetting, linear-probe and fine-tuning adaptation, and CKA analysis of representation drift (Cossu et al., 2022, Sun et al., 2023)
Ablation experiments confirm that early introduction of general corpus, replay, EMA, self-distillation, and curriculum scheduling reduce catastrophic forgetting and enhance overall robustness. In contrast, vanilla CPT (domain-only) often leads to catastrophic loss of generalization and emergent abilities (Elhady et al., 30 May 2025, Zheng et al., 2 Jul 2024).
6. Implementation Recommendations and Limitations
Extensive multi-paper synthesis yields the following guidelines for CPT pipelines:
- For cross-lingual or domain adaptation, mix 20–50% general/base data early in CPT; phase out once stability is established (Elhady et al., 30 May 2025, Zheng et al., 2 Jul 2024).
- Always include replay or EMA to regularize parameter drift, especially for emergent abilities.
- Profile and fit scaling laws from short CPT runs under target learning rate schedules; leverage predictive formulas to select optimal data mixture, replay, LR, and token budgets (Wang et al., 12 May 2025, Que et al., 3 Jun 2024, Gu et al., 24 Jul 2024).
- Monitor retention on pre-training tasks throughout CPT to quantify drift and adjust data mix, LR schedule, or sampling strategy.
- Adapter-only CPT and efficient subset selection enable rapid adaptation with limited hardware or data (Chih et al., 2 Oct 2025, Nag et al., 13 Dec 2024).
- For agentic and multi-domain tasks, synthesize high-quality demonstration trajectories or behavioral sequences and jointly optimize for both memorization and utilization signals (Su et al., 16 Sep 2025, Ma et al., 11 Apr 2025).
- Overly long CPT, excessive replay, or synthetic-data-corruption degrade performance; balance quality, difficulty, and proportion of curated datasets.
- Model architecture, base checkpoint selection, and learning rate schedule shape the attainable loss potential and adaptability for future CPT (Wang et al., 5 Oct 2024, Cossu et al., 2022).
Limitations remain in scaling CPT protocols to 100B+ parameters or highly specialized domains, tuning dynamic mixture schedules, automated curriculum design, and formal theory relating mixture ratio, model size, and training duration. Empirical guidelines are validated primarily in the 0.5–30B parameter, 10–100B token regime.
7. Outlook and Research Directions
Current CPT research focuses on:
- Generalizing scaling laws and mixture models to larger and more diverse architectures and domains
- Automating mixture ratio, replay, and curriculum scheduling for efficient, robust continual model updating
- Extending CPT frameworks to multimodal, agentic, and sparse-gated architectures (MoE) with formal guarantees of sample efficiency and routing stability
- Quantitative paper of catastrophic forgetting and retention of emergent abilities during CPT in low-data and cross-lingual scenarios
- Development of adapters, pruning, and tokenization expansion strategies for resource-constrained adaptation
CPT remains a foundational framework for LLM adaptation, robust LLM updating, and the emergence and preservation of specialized and generalist capabilities. Empirical advances in curriculum design, scaling law fitting, self-distillation, and modular adaptation continue to shape best practices for research and production pipelines (Elhady et al., 30 May 2025, Zheng et al., 2 Jul 2024, Abbes et al., 3 Aug 2025, Que et al., 3 Jun 2024).