Continual Pre-Training in Deep Learning

Updated 24 October 2025

Continue pre-training is an extension of the initial deep learning cycle that uses self-supervised techniques to update models with new data.
Self-supervised approaches such as MoCo‑V2, Barlow Twins, and SwAV enable robust feature generalization and improved accuracy in low-label regimes.
Integration with continual learning algorithms like REMIND and online softmax allows effective adaption and mitigates catastrophic forgetting across evolving data streams.

A continue pre-training phase is an extension of the initial pre-training cycle for deep learning models, typically invoked when new classes, domains, or data become available after the initial model has been trained offline. In continual and online learning scenarios, this phase—sometimes called “continual pre-training,” “continued pre-training,” or “further pre-training”—is essential to foster generalizable representations, mitigate catastrophic forgetting, and optimize knowledge transfer when adapting to evolving data streams or non-stationary distributions.

1. Self-Supervised Approaches for Continual Pre-Training

The continue pre-training phase often leverages self-supervised learning (SSL) algorithms, replacing or augmenting the standard supervised stage. The core motivation is that SSL can yield more transferable features—especially when pre-training data is limited. Prominent evaluated approaches include:

MoCo-V2: Organizes contrastive learning via a momentum encoder and dynamic dictionary. The loss encourages augmented “query-key” pairs from the same image to be close, while all others differ. MoCo-V2 introduces an additional projection layer and more aggressive augmentations (such as blurring).
Barlow Twins: Processes paired augmentations through siamese networks, optimizing a cross-correlation objective toward the identity matrix. The loss,

$\text{Loss} = \sum_i (1 - C_{i,i})^2 + \lambda \sum_{i} \sum_{j \neq i} (C_{i,j})^2,$

simultaneously enforces invariance along diagonal terms and redundancy reduction across off-diagonals.

SwAV: Eschews pairwise contrastive mechanisms by assigning cluster labels (“prototypes”) to augmented views and leveraging a swapped prediction mechanism. Clustering-based SSL methods like SwAV sidestep the need for massive negative sample queues.

These strategies are shown to be pivotal in extracting representations that generalize beyond the initial labeled subset.

2. Comparison to Supervised Pre-Training

Comprehensive experiments demonstrate that self-supervised continue pre-training can produce representations with stronger generalization, particularly when the initial labeled subset is small. In offline linear evaluation, SwAV, Barlow Twins, and MoCo-V2 repeatedly outperform supervised features on novel ImageNet classes. The gains are most pronounced when only a small portion of classes (e.g., 10 out of 1000) are used for initial representation learning, with relative top-1 accuracy improvements up to 20%; even with moderate sampling (e.g., 10–15 classes), substantial benefits persist.

As the pre-training set expands (e.g., 75–100 classes), the gap narrows and may eventually invert as the data size increases—suggesting that the main advantage of SSL occurs when supervision is scarce.

3. Integration with Downstream Continual Learning Algorithms

Continued pre-training is not used in isolation but rather as a feature extractor or initialization module for online continual learning policies. The study integrates the resulting features (from supervised or SSL pre-training) into several continual learning algorithms:

Algorithm	Features Used	Adaptation	Replay
Deep Streaming LDA (SLDA)	Frozen	LDA stats	N/A
Online Softmax with Replay	Frozen	Softmax FT	Feature buf
REMIND	Lower layers: frozen	Higher layers: plastic	PQ replay

Deep SLDA: Features are frozen; classifiers are updated incrementally via streaming means and shared covariance.
Online Softmax with Replay: Maintains a replay buffer and performs softmax adaptation with sampled batches.
REMIND: Retains frozen lower layers; higher layers are updated using replay of compressed mid-level features (“product quantization”), allowing plasticity at higher levels while anchoring foundational representations.

This integration enables continual adaptation without revisiting the full labeled data.

4. Impact of Sample Size and Data Regimes

A critical observation is the scaling advantage of SSL with limited data. Self-supervised pre-training is most advantageous in low-shot or compressed-class pre-training regimes. Offline and online evaluations repeatedly show that with minimal supervision, SSL substantially outpaces standard supervised learning. The advantage decreases with abundant pre-training data, at which point supervised learning may achieve parity.

5. Generalization and Transfer

Self-supervised features were found not only to generalize better within the same dataset but also across datasets. For example, features pre-trained on ImageNet with SSL transferred more effectively to Places-365 than those from supervised pre-training, highlighting their robustness for transfer learning.

A notable empirical result was a 14.95% relative increase in top-1 accuracy on class-incremental ImageNet, setting a new state of the art for online continual learning. These improvements were robust across three distinct continual learning protocols.

6. Practical and Theoretical Implications

Data Efficiency: SSL reduces dependence on large labeled datasets, an important consideration for continual adaptation.
Catastrophic Forgetting: Integration of SSL features with replay or plasticity-stabilizing approaches like REMIND further reduces catastrophic forgetting.
Execution Platform: All SSL variants are architecture-agnostic and are deployable in large-scale settings.
Limitations: As pre-training data increases, supervised learning can eventually catch up or surpass SSL, so benefits are regime-dependent.

7. Future Research Directions

Several open problems and avenues are outlined:

Investigating wider and deeper CNNs in SSL for continual learning, as scaling these architectures has led to improvements elsewhere.
Developing semi-supervised pre-training methods that combine the strengths of SSL and conventional supervision.
Designing online SSL (or semi-supervised) update rules that support continual adaptation alongside incremental task exposure.
Extending SSL-pre-trained representations to non-classification continual learning tasks such as detection, segmentation, or robotics.
Studying mechanisms for open-world learning—including detection of unknown or out-of-distribution samples—given the robustness of SSL features.

Conclusion

The continue pre-training phase—especially with self-supervised approaches such as MoCo‑V2, Barlow Twins, and SwAV—demonstrates superior efficacy over supervised pre-training in online continual learning, particularly with limited initial labels. These findings establish SSL as a robust foundation for future continual learning systems, capable of incrementally adapting to new classes or tasks while maintaining strong generalization and mitigating catastrophic forgetting. The integration of SSL representations within scalable continual learning pipelines (e.g., REMIND, SLDA, Online Softmax with Replay) constitutes a significant advance and raises compelling directions for extending and applying these methods across domains and learning settings (Gallardo et al., 2021).

Markdown Report Issue Upgrade to Chat

References (1)

Self-Supervised Training Enhances Online Continual Learning (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Continue Pre-Training Phase.

Continual Pre-Training in Deep Learning

1. Self-Supervised Approaches for Continual Pre-Training

2. Comparison to Supervised Pre-Training

3. Integration with Downstream Continual Learning Algorithms

4. Impact of Sample Size and Data Regimes

5. Generalization and Transfer

6. Practical and Theoretical Implications

7. Future Research Directions

Conclusion

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Continual Pre-Training in Deep Learning

1. Self-Supervised Approaches for Continual Pre-Training

2. Comparison to Supervised Pre-Training

3. Integration with Downstream Continual Learning Algorithms

4. Impact of Sample Size and Data Regimes

5. Generalization and Transfer

6. Practical and Theoretical Implications

7. Future Research Directions

Conclusion

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research