Preventing Zero-Shot Transfer Degradation in Continual Learning of Vision-Language Models (2303.06628v2)

Published 12 Mar 2023 in cs.CV and cs.LG

Abstract: Continual learning (CL) can help pre-trained vision-LLMs efficiently adapt to new or under-trained data distributions without re-training. Nevertheless, during the continual training of the Contrastive Language-Image Pre-training (CLIP) model, we observe that the model's zero-shot transfer ability significantly degrades due to catastrophic forgetting. Existing CL methods can mitigate forgetting by replaying previous data. However, since the CLIP dataset is private, replay methods cannot access the pre-training dataset. In addition, replaying data of previously learned downstream tasks can enhance their performance but comes at the cost of sacrificing zero-shot performance. To address this challenge, we propose a novel method ZSCL to prevent zero-shot transfer degradation in the continual learning of vision-LLMs in both feature and parameter space. In the feature space, a reference dataset is introduced for distillation between the current and initial models. The reference dataset should have semantic diversity but no need to be labeled, seen in pre-training, or matched image-text pairs. In parameter space, we prevent a large parameter shift by averaging weights during the training. We propose a more challenging Multi-domain Task Incremental Learning (MTIL) benchmark to evaluate different methods, where tasks are from various domains instead of class-separated in a single dataset. Our method outperforms other methods in the traditional class-incremental learning setting and the MTIL by 9.7% average score. Our code locates at https://github.com/Thunderbeee/ZSCL.

PDF Abstract

Overview of "Preventing Zero-Shot Transfer Degradation in Continual Learning of Vision-LLMs"

The paper "Preventing Zero-Shot Transfer Degradation in Continual Learning of Vision-LLMs" addresses a critical challenge in the domain of continual learning (CL), particularly focusing on vision-LLMs like CLIP (Contrastive Language-Image Pre-training). The authors identify that during the continual learning process, the zero-shot transfer capability of such models significantly decreases due to catastrophic forgetting. This is a phenomenon where the model loses information related to previously learned tasks when it is trained on new tasks sequentially.

Key Contributions

Identifying the Challenge: The paper begins by noting that while existing CL methods can mitigate forgetting through data replay, these methods fall short for vision-LLMs like CLIP. This is because replay methods typically require access to previous data, which is impractical when datasets are private or proprietary. Furthermore, attempts to maintain performance on previously trained tasks often result in a trade-off, sacrificing the model's zero-shot capabilities.
ZSCL Method: To address these issues, the authors propose a novel method called ZSCL (Zero-Shot Continual Learning), which preserves the zero-shot transfer ability both in the feature and parameter space. In the feature space, they introduce a reference dataset for distillation, employing both feature representation and parameter averaging during training. This dataset does not require labeling or exact image-text pairing, provided it covers diverse semantics.
Parameter Space Intervention: The paper suggests preventing excessive shifts in parameters through weight averaging techniques. This involves interpolating model weights throughout training to maintain a balance between zero-shot efficiency and new task learning.
MTIL Benchmark: Additionally, the authors propose a new benchmark called Multi-domain Task Incremental Learning (MTIL). Unlike previous benchmarks focused on class-separation within a single dataset, MTIL involves tasks from various domains, enabling rigorous assessment across a broader range of generalization and transfer scenarios.

Experimental Evaluation

The proposed ZSCL method is extensively evaluated and demonstrates superior performance over existing methods across both the traditional class-incremental learning settings and the newly introduced MTIL benchmark. The authors report an average score improvement of 9.7% compared to other methods, thus showcasing ZSCL's efficacy in maintaining and enhancing zero-shot transfer capabilities.

Implications and Future Directions

The insights from this paper have significant implications for the development of robust vision-LLMs capable of continual learning. The ZSCL method potentially enables models to be updated incrementally without compromising their foundational knowledge and zero-shot capabilities. This is critical for real-world applications requiring continuous learning and adaptation to new information without extensive resource consumption.

In future work, addressing the limitations identified, such as the need for a reference dataset, could further enhance the utility of ZSCL. Techniques such as synthetic data generation might overcome this dependency, providing flexibility in deployment across various settings. Moreover, extending the principles outlined in ZSCL to other emerging AI paradigms, such as large multi-modality models, presents a promising avenue for further exploration.

In conclusion, this research represents a considerable advancement in the field of continual learning for vision-LLMs, setting a foundational framework upon which future innovations can be built.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Zangwei Zheng (19 papers)
Mingyuan Ma (16 papers)
Kai Wang (624 papers)
Ziheng Qin (8 papers)
Xiangyu Yue (93 papers)
Yang You (173 papers)

Citations (47)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - Thunderbeee/ZSCL: Preventing Zero-Shot Transfer Degradation in Continual Learning of Vision-Language Models (96 stars)