Memory-Efficient Continual Learning with CLIP Models

Published 5 May 2026 in cs.LG | (2605.03866v1)

Abstract: Contrastive Language-Image Pretraining (CLIP) models excel at understanding image-text relationships but struggle with adapting to new data without forgetting prior knowledge. To address this, models are typically fine-tuned using both new task data and a memory buffer of past tasks. However, CLIP's contrastive loss suffers when the memory buffer is small, leading to performance degradation on previous tasks. We propose a memory-efficient, distributionally robust method that dynamically reweights losses per class during training. Our approach, tested on class incremental settings (CIFAR-100, ImageNet1K) and a domain incremental setting (DomainNet) adapts CLIP models quickly while minimizing catastrophic forgetting, even with minimal memory usage.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces innovative Global Contrastive Loss (GCL) and Group Distributionally Robust Optimization (GDRO) methods to adapt CLIP for continual learning under limited memory.
It demonstrates that GDRO consistently outperforms baselines, achieving higher retention on CIFAR-100, ImageNet-1k, and DomainNet benchmarks even with reduced replay buffers.
Experiments validate that dynamically reweighting training objectives for difficult classes is key to mitigating catastrophic forgetting in memory-constrained environments.

Memory-Efficient Continual Learning with CLIP Models: An Expert Analysis

Introduction and Motivation

This work addresses the integration of Continual Learning (CL) paradigms into the CLIP model—a vision-language foundation architecture trained with bimodal contrastive objectives. While CLIP exhibits strong zero-shot and transfer abilities, fine-tuning on sequential tasks introduces catastrophic forgetting, especially when memory resources for rehearsal are restricted. The core research aim is to determine how CLIP can be adapted for highly memory-efficient CL while maintaining stability across both class-incremental (CIL) and domain-incremental (DIL) settings.

Methodology

The authors propose two primary methods: Global Contrastive Loss (GCL) and Group Distributionally Robust Optimization (GDRO). Both strategies leverage CLIP’s inherent joint embedding of images and label-text to mitigate forgetting and facilitate adaptation.

Bimodal Contrastive Continual Learning (GCL)

GCL extends the CLIP contrastive learning objective into the CL context. At every stage, both new task samples and a fixed-size memory buffer of previous data are used. The contrastive loss function is evaluated over these combined samples, benefiting from label encoding via the text encoder. To address computational bottlenecks in the partition function over large datasets, the approach adopts moving average estimators for necessary terms, enabling per-batch gradient estimation and facilitating information propagation from previous tasks.

Group Distributionally Robust Optimization (GDRO)

GDRO targets the imbalanced data distributions that arise due to finite memory buffers. After each task, the number of per-class examples in the buffer is reduced to accommodate new classes, severely skewing distributions—an issue that typical CLIP tuning does not resolve. The authors implement a distributionally robust objective that explicitly increases the loss weight for classes exhibiting higher current task losses. This approach formalizes a group DRO framework over classes, employing moving average estimators and stochastic compositional optimization for scalable training. The closed-form solution for the robust optimization over the simplex incorporates a KL divergence penalty for further stability.

Experimental Design

Experiments employ a CLIP model with a ViT-B/16 vision backbone, evaluated on two CIL datasets (CIFAR-100, ImageNet-1k) and the DIL benchmark DomainNet. The following baselines were compared under identical pretrained initialization and memory conditions: EWC, DER, iCaRL, Co2L, FOSTER, and state-of-the-art supervised as well as self-supervised contrastive replay methods.

Performance is primarily assessed by accuracy over all encountered classes or domains after each task increment, thus directly measuring forgetting and forward transfer. Both the mean final accuracy and per-stage performance trajectories are reported, with particular attention to performance under varying memory buffer sizes.

Results and Empirical Findings

The empirical results demonstrate that both GCL and GDRO methods consistently outperform all baselines across memory sizes in CIL settings, most notably on ImageNet-1k and CIFAR-100. The unimodal contrastive baseline Co2L degrades sharply as memory size shrinks, unable to bridge the supervision gap with limited stored samples. When no buffer is available, GCL performance also deteriorates, whereas the GDRO method exhibits robust stability, maintaining high retention even under extreme memory constraints—a critical property for real-world deployments with privacy or storage limitations.

On the DomainNet benchmark (DIL), both GCL and GDRO methods surpass prior zero-shot and supervised baselines, with GDRO again proving superior in situations with imbalanced or minimal replay memory. The results strongly suggest that dynamically adapting the training objective to focus on difficult classes is essential for mitigating catastrophic forgetting when per-class memory is severely limited.

Theoretical and Practical Implications

This work exemplifies the advantage of leveraging contrastive, bimodal pretraining within CL frameworks. Embedding labels via natural language further immunizes the model against drift, in contrast with classical linear probing strategies. The proposed GDRO loss extends recent advances in distributionally robust optimization to the CLIP setting, offering principled mitigation for replay memory imbalances—an aspect typically neglected in prior rehearsive or replay-based CL research.

Practically, these methods enable the deployment of CLIP-derived systems in streaming, privacy-constrained, or edge scenarios where only a small memory buffer is feasible. Theoretically, the compositional optimization formulation and the adaptation to bimodal contrastive losses both broaden the continual learning landscape by improving scalability and robustness for multimodal foundation models.

Future Research Directions

Further exploration could probe the incorporation of generative memory (rather than strict buffers) into the GDRO framework, dynamic buffer management schemes responsive to observed class drift, and extensions to non-vision modalities within CLIP. Scaling to larger, real-world non-stationary distributions and characterizing the interplay with prompt engineering or adapter-based CLIP updates are also promising directions. Additionally, formal convergence and optimality analyses for the compositional optimization procedures used would offer further insight.

Conclusion

This paper presents two complementary, memory-efficient methods for continual learning with CLIP models, effectively reducing catastrophic forgetting and maintaining adaptation performance in both class-incremental and domain-incremental streams. Joint label-image contrastive embeddings combined with distributionally robust loss reweighting yield state-of-the-art results particularly in stringent memory scenarios, establishing robust baselines for future multimodal continual learning research.

Markdown Report Issue