Overcoming the Pitfalls of Vision-Language Model Finetuning for OOD Generalization (2401.15914v2)

Published 29 Jan 2024 in cs.CV and cs.AI

Abstract: Existing vision-LLMs exhibit strong generalization on a variety of visual domains and tasks. However, such models mainly perform zero-shot recognition in a closed-set manner, and thus struggle to handle open-domain visual concepts by design. There are recent finetuning methods, such as prompt learning, that not only study the discrimination between in-distribution (ID) and out-of-distribution (OOD) samples, but also show some improvements in both ID and OOD accuracies. In this paper, we first demonstrate that vision-LLMs, after long enough finetuning but without proper regularization, tend to overfit the known classes in the given dataset, with degraded performance on unknown classes. Then we propose a novel approach OGEN to address this pitfall, with the main focus on improving the OOD GENeralization of finetuned models. Specifically, a class-conditional feature generator is introduced to synthesize OOD features using just the class name of any unknown class. Such synthesized features will provide useful knowledge about unknowns and help regularize the decision boundary between ID and OOD data when optimized jointly. Equally important is our adaptive self-distillation mechanism to regularize our feature generation model during joint optimization, i.e., adaptively transferring knowledge between model states to further prevent overfitting. Experiments validate that our method yields convincing gains in OOD generalization performance in different settings. Code: https://github.com/apple/ml-ogen.

PDF Abstract

Abstract

Research into vision-language (V-L) models has seen rapid progress in terms of generalization across various domains and tasks. Vision-LLMs like CLIP, adept at zero-shot recognition, have experienced significant growth, particularly in understanding out-of-distribution (OOD) samples during fine-tuning. Nevertheless, these models are prone to overfitting known dataset classes after sustained fine-tuning. Addressing this issue, the current paper introduces an innovative approach dubbed OGEN, which enhances OOD generalization in fine-tuned models by leveraging a class-conditional feature generator and an adaptive self-distillation mechanism for robust regularization.

Introduction

State-of-the-art models in the vision-language sphere, such as CLIP, exhibit promising zero-shot capabilities; however, they often falter with OOD samples, which require recognition beyond the in-distribution set. Enhancements via fine-tuning methods have shown improvement in both in-distribution and OOD performance but have also revealed a tendency toward overfitting. This work primarily addresses the overfit-pitfall commonly encountered in V-L model fine-tuning and explores the added advantages of utilizing a novel method that concurrently leverages synthesized OOD data and a self-distillation strategy.

Methodology

The newly proposed OGEN method tackles the overfitting issue by incorporating two key components:

A class-conditional feature generator introduces the capability to synthesize features of unknown classes given only the class name, capitalizing on the well-aligned image-text feature space in CLIP. It employs a lightweight attention mechanism, generating features that model complex distributions of OOD samples.
Adaptive self-distillation, conducted by teaching the current status of the model using outputs from earlier checkpoints, promotes a more generalized solution and mitigates the risks of overfitting to known classes.

The combination of these components results in a model with impressive robustness and balance in performance across known and unknown classes.

Experiments and Results

Extensive experiments validate the OGEN approach's effectiveness under two OOD generalization settings: within-dataset and cross-dataset generalizations. The results signify consistent improvement in OOD generalization across various fine-tuning methodologies applied to CLIP-like architectures. Notably, OGEN enhances OOD accuracy by up to 18.77% under certain conditions. This performance can be attributed to the effective use of synthesized OOD features during optimization and a sophisticated self-distillation method that prevents model checkpoint overfitting.

Conclusion

This paper breaks new ground in improving the OOD generalization of V-L models by pinpointing and addressing key overfitting pitfalls. The introduction of class-conditional feature synthesis and adaptive self-distillation as a dual approach offers a robust framework for model regularization, resulting in significantly enhanced performance on OOD samples. Future work could include extending OGEN's findings to other fine-tuning methods and exploring its application to uncertainty modeling on unseen data for further advancements in V-L research.