Abstract
Research into vision-language (V-L) models has seen rapid progress in terms of generalization across various domains and tasks. Vision-LLMs like CLIP, adept at zero-shot recognition, have experienced significant growth, particularly in understanding out-of-distribution (OOD) samples during fine-tuning. Nevertheless, these models are prone to overfitting known dataset classes after sustained fine-tuning. Addressing this issue, the current paper introduces an innovative approach dubbed OGEN, which enhances OOD generalization in fine-tuned models by leveraging a class-conditional feature generator and an adaptive self-distillation mechanism for robust regularization.
Introduction
State-of-the-art models in the vision-language sphere, such as CLIP, exhibit promising zero-shot capabilities; however, they often falter with OOD samples, which require recognition beyond the in-distribution set. Enhancements via fine-tuning methods have shown improvement in both in-distribution and OOD performance but have also revealed a tendency toward overfitting. This work primarily addresses the overfit-pitfall commonly encountered in V-L model fine-tuning and explores the added advantages of utilizing a novel method that concurrently leverages synthesized OOD data and a self-distillation strategy.
Methodology
The newly proposed OGEN method tackles the overfitting issue by incorporating two key components:
- A class-conditional feature generator introduces the capability to synthesize features of unknown classes given only the class name, capitalizing on the well-aligned image-text feature space in CLIP. It employs a lightweight attention mechanism, generating features that model complex distributions of OOD samples.
- Adaptive self-distillation, conducted by teaching the current status of the model using outputs from earlier checkpoints, promotes a more generalized solution and mitigates the risks of overfitting to known classes.
The combination of these components results in a model with impressive robustness and balance in performance across known and unknown classes.
Experiments and Results
Extensive experiments validate the OGEN approach's effectiveness under two OOD generalization settings: within-dataset and cross-dataset generalizations. The results signify consistent improvement in OOD generalization across various fine-tuning methodologies applied to CLIP-like architectures. Notably, OGEN enhances OOD accuracy by up to 18.77% under certain conditions. This performance can be attributed to the effective use of synthesized OOD features during optimization and a sophisticated self-distillation method that prevents model checkpoint overfitting.
Conclusion
This paper breaks new ground in improving the OOD generalization of V-L models by pinpointing and addressing key overfitting pitfalls. The introduction of class-conditional feature synthesis and adaptive self-distillation as a dual approach offers a robust framework for model regularization, resulting in significantly enhanced performance on OOD samples. Future work could include extending OGEN's findings to other fine-tuning methods and exploring its application to uncertainty modeling on unseen data for further advancements in V-L research.