Continual Learning in Vision-Language Models

Updated 13 December 2025

The VLM-based continual learning framework is a system that incrementally updates vision-language models with parameter-efficient adaptations to assimilate new visual and textual tasks.
It employs dynamic rank-selective LoRA techniques to balance rapid adaptation (plasticity) and effective knowledge retention (stability) while mitigating catastrophic forgetting.
Empirical evaluations on benchmarks like X-TAIL and MTIL demonstrate improved zero-shot performance and overall accuracy without increasing inference complexity.

A Vision-LLM (VLM)-based continual learning framework refers to any system where a pre-trained VLM (such as CLIP or similar architectures) is adapted incrementally to a series of new tasks or domains without catastrophic forgetting or collapse of its original zero-shot generalization abilities. These frameworks address the complex challenge of updating large cross-modal models—often transformer-based—so they accrue new visual and textual knowledge from sequential input streams while retaining their prior capabilities. Recent advances have focused on parameter-efficient techniques and explicit balancing of plasticity and stability, with rigorous benchmarks to assess both adaptation and retention (Lu et al., 2024).

1. Continual Learning in Vision-LLMs

Continual learning for VLMs is characterized by the sequential arrival of tasks or domains, $\mathcal{D}^1,\dots,\mathcal{D}^T$ , each potentially differing in visual or textual distribution. The principal trade-off is between:

Plasticity: rapid adaptation to recent tasks/domain shifts, typically requiring large parameter updates.
Stability: retention of knowledge from prior tasks, particularly zero-shot capabilities; excessive adaptation risks catastrophic overwriting.

In VLMs, this trade-off is exacerbated by the scale and entanglement of multimodal representations: blanket fine-tuning overfits and forgets prior alignments, while freezing the backbone hinders adaptation. This drives research toward structural or algorithmic mechanisms which localize updates, impose regularization, or exploit modularity (Lu et al., 2024, Liu et al., 6 Aug 2025).

2. Adaptive Parameter-Efficient Adaptation: Dynamic Rank-Selective LoRA

CoDyRA (COntinual learning with DYnamic RAnk-selective LoRA) advances continual VLM training by attaching dynamic, rank-selective Low-Rank Adaptation (LoRA) modules to each weight matrix $W_0^m$ of the VLM's transformer layers. Rather than rigid parameterization or task-isolated adapters, CoDyRA performs the following:

Decomposes LoRA updates into $r$ rank-one directions, each with an adaptive importance scalar $w_i^{t,m}$ .
Introduces an $\ell_1$ -sparsity penalty on $w^{t,m}$ , promoting selection of only task-relevant ranks.
Employs a proximal/soft-thresholding update, pruning rank directions $|w_i|\le\kappa$ as $\kappa$ is annealed during training.
Post-task, the significant LoRA updates are merged directly into the frozen backbone $W_0^{m}$ , and all ephemeral adapter weights are discarded—with no added inference cost or architecture change.

No explicit domain, task, or external memory labeling is used. This rank-adaptive scheme automatically balances plasticity (via task-informed rank expansion) and stability (via module-wise structured sparsity and pruning), preventing both overfitting and catastrophic forgetting (Lu et al., 2024).

3. Algorithmic Workflow and Optimization

The CoDyRA framework proceeds as follows:

Initialization: For each new task $t$ , initialize LoRA $B^{t,m},A^{t,m}$ and importance weights $w^{t,m}$ for all modules $m$ .
Warmup Phase: Perform dense updates on $w^{t,m}$ , accumulating evidence of their impact on adaptation.
Sparse Training: After warmup, iteratively apply soft-thresholding to $w^{t,m}$ , incrementally raising $\kappa$ to promote sparsity.
Post-hoc Merging: For each $m$ , sum the nonzero rank-1 LoRA updates and fold them into $W_0^{m}$ ; all task-specific parameters are discarded.
No Replay/Reference: The only knowledge retained is encoded in the backbone's updated weights; no example replay or explicit regularization across tasks is used.

This pipeline has low computational/memory overhead and introduces no latency or parameter overhead at deployment.

4. Quantitative Evaluation and Empirical Results

CoDyRA is benchmarked on MTIL (multi-domain, task-incremental) and X-TAIL (cross-domain, agnostic) protocols, with primary metrics:

Transfer: Zero-shot accuracy on held-out (unseen) domains post-CL.
Last: Accuracy on the earliest tasks after all tasks are seen.
Average: Harmonic or mean of Transfer and Last.

On X-TAIL, CoDyRA achieved an Average of 72.1% (vs 70.7% for the strongest prior), and Last of 80.9% (vs 79.1%). On MTIL (5-shot), it delivered an Average of 74.3% and Last of 80.8%, surpassing earlier LoRA/tuned or adapter-based schemes. Notably, zero-shot accuracy on completely unseen benchmarks (e.g., ImageNet-1k, CIFAR100) improved after continual training, indicating enhancement rather than degradation of generalization (Lu et al., 2024).

Ablations demonstrate:

Dynamic rank pruning is crucial for recovering zero-shot performance.
The joint update of entire transformer stacks (vision + text) is superior to limited adaptation (e.g., only attention).
Sparsity and pruning schedule hyperparameters directly control the transfer/forgetting trade-off.

5. Comparative Context within the VLM-CL Taxonomy

Within the broader VLM continual learning taxonomy (Liu et al., 6 Aug 2025), CoDyRA exemplifies Parameter-Efficient Adaptation:

It avoids cross-modal drift (a key failure mode) without multimodal replay by ensuring updates are local, structured, and adaptively minimized.
Shared-module interference is mitigated by rank-sparse, task-informed LoRA adaptation; no parameter or prompt "bloat" occurs at inference.
Zero-shot erosion is directly suppressed through aggressive pruning of spurious local directions, thereby retaining or improving the original CLIP's embedding geometry.

Unlike replay- or distillation-centric methods, CoDyRA achieves high stability entirely through adaptive parameterization, without auxiliary loss terms or additional stored models.

6. Limitations and Open Research Directions

All task updates are merged destructively into the backbone without preserving per-task history (i.e., no possibility of re-activating prior adapters). Hence, inter-task correlations or task-aware reweighting are not explicitly modeled.
The framework does not deliver continual expansion of the representation (e.g., growing vocabulary or architecture).
Potential future improvements include hierarchical/grouped rank selection, module-specific pruning adaptivity, combination with limited memory replay for rare-class stabilization, and extension to scenarios where new modalities or class vocabularies are introduced during CL.

The method’s merge-and-prune paradigm opens directions for future approaches that combine its structural adaptation with lightweight replay, generative pseudo-samples, or hierarchical parameter grouping (Lu et al., 2024, Liu et al., 6 Aug 2025).

7. Summary Table: Core Aspects of CoDyRA

Aspect	Implementation Details	Empirical Impact
Parameter Adaptation	Dynamic, per-module, per-rank LoRA updates with post-task merging	No inference overhead
Stability Mechanism	$\ell_1$ -sparsity, annealed soft-thresholding, only significant directions kept	Superior retention
Plasticity Mechanism	Modules/ranks driven by task-specific gradients, not fixed at initialization	Enhanced adaptation
Evaluation Metrics	Transfer, Last, Average (MTIL, X-TAIL)	SOTA on benchmarks
External Memory/Past Data	None	Low resource requirements

CoDyRA epitomizes a new generation of continual updating strategies for VLMs that leverages adaptive, importance-selective low-rank adaptation, proving that strong knowledge retention and flexible adaptation are achievable without increased inference complexity or explicit memory buffers (Lu et al., 2024).

Markdown Report Issue Upgrade to Chat

References (2)

Adaptive Rank, Reduced Forgetting: Knowledge Retention in Continual Learning Vision-Language Models with Dynamic Rank-Selective LoRA (2024)

Continual Learning for VLMs: A Survey and Taxonomy Beyond Forgetting (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to VLM-based Continual Learning Framework.

Continual Learning in Vision-Language Models

1. Continual Learning in Vision-LLMs

2. Adaptive Parameter-Efficient Adaptation: Dynamic Rank-Selective LoRA

3. Algorithmic Workflow and Optimization

4. Quantitative Evaluation and Empirical Results

5. Comparative Context within the VLM-CL Taxonomy

6. Limitations and Open Research Directions

7. Summary Table: Core Aspects of CoDyRA

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Continual Learning in Vision-Language Models

1. Continual Learning in Vision-LLMs

2. Adaptive Parameter-Efficient Adaptation: Dynamic Rank-Selective LoRA

3. Algorithmic Workflow and Optimization

4. Quantitative Evaluation and Empirical Results

5. Comparative Context within the VLM-CL Taxonomy

6. Limitations and Open Research Directions

7. Summary Table: Core Aspects of CoDyRA

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research