CL-CLIP: CLIP-Based Continual Learning Framework with Cost-Volume Category Decoupling for Object Detection

Published 5 Jun 2026 in cs.CV | (2606.06978v1)

Abstract: Continual Object Detection (COD) requires a detector to acquire new categories over time while preserving previously learned ones. This goal is closely related to open-vocabulary detection, since both settings require reasoning over categories that are not fully covered by the annotations available at the current training stage. Recent CLIP-based open-vocabulary detectors have shown strong zero-shot generalization, and frameworks such as F-ViT demonstrate that vision-language pretraining can provide powerful zero-shot detection ability for unseen categories. However, real-world deployments cannot remain purely zero-shot: once these detectors are continually updated on newly introduced categories, they suffer severe catastrophic forgetting and quickly lose their previously calibrated detection ability. We therefore propose CL-CLIP, a CLIP-based COD framework that equips open-vocabulary detectors with better continual learning ability through cost-volume-guided category decoupling. Specifically, following CAT-Seg, we compute a CLIP image-text similarity cost volume, defined as dense category-wise response maps between visual tokens and class text embeddings. This zero-shot spatial prior decomposes shared region features into class-specific pathways, which are then processed by a Multi-Expert RoI head. Extensive experiments on PASCAL VOC and MS-COCO show that CL-CLIP substantially improves the F-ViT baseline under continual fine-tuning and achieves competitive performance with existing continual object detectors, especially in adapting to newly introduced categories while preserving competitive base-class performance.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper presents CL-CLIP, a CLIP-based continual object detection architecture that mitigates background relegation and catastrophic forgetting.
It introduces a cost-volume guided decoupling pathway and a multi-expert RoI head to stabilize spatial category signals and reduce cross-category interference.
Experimental results on PASCAL VOC and MS-COCO validate improved old-category retention and balanced plasticity compared to existing state-of-the-art methods.

CL-CLIP: A Cost-Volume Decoupling Framework for Continual Object Detection with CLIP

Problem Formulation and Motivation

Continual Object Detection (COD) mandates maintaining recognition performance on previously learned categories while sequentially incorporating new categories. Despite the demonstrated zero-shot generalization of recent vision-LLMs (VLMs), such as CLIP, existing open-vocabulary detectors experience severe catastrophic forgetting when applied in continual settings due to label incompleteness and the shared-head design. The core challenge arises from the background relegation effect: unlabeled objects from previous tasks are misclassified as background, which rapidly erodes decision boundaries during sequential fine-tuning.

This paper introduces CL-CLIP, a CLIP-based continual object detection architecture that systematically tackles background relegation and head interference by leveraging cost-volume-guided category decoupling. The method is specifically motivated by the observation that zero-shot image-text alignment priors from frozen CLIP models can provide stable, spatially informative category signals, which, if used for feature routing, can prevent cross-category interference regardless of annotation incompleteness.

Methodological Contributions

CL-CLIP builds upon the F-ViT-style open-vocabulary detection pipeline with two central innovations: (1) a cost-volume-guided decoupling pathway informed by dense category-wise CLIP image-text similarities, and (2) a Multi-Expert RoI head with per-category convolutional experts and a drift-regularized FPN.

Figure 1: Overview of CL-CLIP, highlighting cost-volume-guided decoupling, spatial-attention guided RPN, a Multi-Expert RoI head, and the orthogonality loss.

The architecture proceeds as follows:

Cost-Volume Decoupling: For an input image, dense CLIP visual tokens are aligned with all relevant textual category embeddings to form a cost volume. This category-indexed spatial map is derived from frozen CLIP encoders and provides a stable spatial prior that is invariant to downstream continual fine-tuning. To reduce the co-activation of spatial locations across categories, a Gram matrix-based orthogonality penalty is introduced, directly minimizing response overlap between category masks.
Spatial-Attention RPN: Proposal generation utilizes a residual gating mechanism where the maximum response across all category-wise cost-volume slices serves as a foreground prior. This form of spatial biasing overcomes naive RPN suppression of unlabeled regions and ensures higher proposal recall for both old and new categories.
Multi-Expert RoI Head: The RoI features for each proposal are decoupled via category-specific cost-volume gating and processed by per-category convolutional experts, producing CLIP-aligned scores via cosine similarity to text embeddings. Old-task experts are frozen after their respective training steps; current-task experts are updated, providing architectural isolation and inhibiting catastrophic drift. EWC-style drift regularization on the shared FPN further stabilizes adaptation dynamics.

Experimental Analysis

Baselines and Decoupling

An empirical study across multiple CLIP variants (FineCLIP, FG-CLIP, EVA-CLIP, SigLIP2) on PASCAL VOC and MS-COCO reveals that, while zero-shot open-vocabulary detectors retain high current-class accuracy, they fail to preserve detection capacity on previously learned classes during continual fine-tuning, with mAP for the old categories frequently collapsing to nearly zero.

Main Results: PASCAL VOC and MS-COCO

On standard PASCAL VOC splits (10+10, 15+5, 19+1), CL-CLIP achieves the highest all-class mAP among compared methods, with especially large gains in old-class retention compared to F-ViT baselines and substantial advances over both prompt-based and replay-based COD approaches.

On the 4-step MS-COCO continual protocol, CL-CLIP maintains a balanced accuracy/plasticity tradeoff, outperforming IOR and MMA baselines both in [email protected] and under stricter mAP@[.5:.95] metrics.

Figure 2: CL-CLIP maintains strong plasticity and overall accuracy across a 4-task COCO sequence versus baselines.

Backbone Generalization

CL-CLIP's architectural advances generalize to different CLIP backbones. Improvements in both retention and adaptation hold consistently for FG-CLIP and FineCLIP.

Figure 3: Consistent improvements in retention and accuracy are observed across both FineCLIP and FG-CLIP backbones on PASCAL VOC.

Feature Space Structure

t-SNE analyses of post-training RoI features show that CL-CLIP generates more semantically compact and separated feature clusters than the baseline. This demonstrates that cost-volume gating combined with category expert specialization leads to a more structured and less entangled embedding space.

Figure 4: CL-CLIP forms more compact and separable RoI feature clusters after continual updates, contrasting with the entangled structure of F-ViT and FineCLIP.

Efficiency Tradeoffs

Inference analysis on PASCAL VOC shows that, while per-category experts cause inference FLOPs to grow linearly with the number of seen categories, CL-CLIP remains more efficient than F-ViT baselines due to a lighter cost-volume pathway, notably in low-to-moderate class regimes.

Figure 5: Despite per-category expert growth, CL-CLIP exhibits lower FLOPs than the F-ViT baseline with competitive parameter counts.

Ablation Studies

Component ablations establish that:

Category decoupling is indispensable; drift regularization alone cannot guarantee retention.
Orthogonality loss on the cost-volume substantially reduces response overlap and improves old-category AP. Without it, the network rapidly collapses towards category-agnostic foreground patterns.
The class aggregation module, inherited from CAT-Seg, is detrimental under missing-label supervision and should be omitted.
Proposal recall and detection accuracy are both improved by the residual-gated attention mechanism in the RPN.

Implications and Future Directions

CL-CLIP reframes the continual detection challenge within open-vocabulary models as a structural problem, solvable by explicit spatial category decoupling rather than exclusive reliance on optimization-level extrapolation (distillation, regularization, replay). The work suggests that foundational vision-LLMs possess stable priors exploitable for continual adaptation, provided that destructive cross-class mixing dynamics are mitigated at the architectural level.

Future research should aim to mitigate the linear growth in inference cost by exploring parameter-efficient expert sharing, sparsification strategies, or dynamic expert routing. Leveraging structured cost volumes in other dense prediction tracks (e.g., segmentation, tracking) for continual adaptation is also a promising avenue.

Conclusion

CL-CLIP establishes that cost-volume-guided category separation enables robust continual object detection atop frozen CLIP backbones, outperforming both naive open-vocabulary transfer and state-of-the-art COD-specific methods. This cost-volume routing paradigm offers a principled path forward for scalable, update-tolerant vision systems leveraging large VLMs.

Markdown Report Issue