Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Finetune like you pretrain: Improved finetuning of zero-shot vision models (2212.00638v1)

Published 1 Dec 2022 in cs.CV and cs.LG

Abstract: Finetuning image-text models such as CLIP achieves state-of-the-art accuracies on a variety of benchmarks. However, recent works like WiseFT (Wortsman et al., 2021) and LP-FT (Kumar et al., 2022) have shown that even subtle differences in the finetuning process can lead to surprisingly large differences in the final performance, both for in-distribution (ID) and out-of-distribution (OOD) data. In this work, we show that a natural and simple approach of mimicking contrastive pretraining consistently outperforms alternative finetuning approaches. Specifically, we cast downstream class labels as text prompts and continue optimizing the contrastive loss between image embeddings and class-descriptive prompt embeddings (contrastive finetuning). Our method consistently outperforms baselines across 7 distribution shifts, 6 transfer learning, and 3 few-shot learning benchmarks. On WILDS-iWILDCam, our proposed approach FLYP outperforms the top of the leaderboard by $2.3\%$ ID and $2.7\%$ OOD, giving the highest reported accuracy. Averaged across 7 OOD datasets (2 WILDS and 5 ImageNet associated shifts), FLYP gives gains of $4.2\%$ OOD over standard finetuning and outperforms the current state of the art (LP-FT) by more than $1\%$ both ID and OOD. Similarly, on 3 few-shot learning benchmarks, our approach gives gains up to $4.6\%$ over standard finetuning and $4.4\%$ over the state of the art. In total, these benchmarks establish contrastive finetuning as a simple, intuitive, and state-of-the-art approach for supervised finetuning of image-text models like CLIP. Code is available at https://github.com/locuslab/FLYP.

Improved Finetuning of Zero-Shot Vision Models

The paper "Finetune like you pretrain: Improved finetuning of zero-shot vision models" presents a noteworthy examination of fine-tuning techniques for CLIP and similar image-text models. The research aims to address how modifications in the fine-tuning process impact performance, both for in-distribution (ID) and out-of-distribution (OOD) scenarios. The authors propose a straightforward method aligning the fine-tuning process with the contrastive nature of pretraining, demonstrating its effectiveness across multiple benchmarks.

The methodology, named Finetune Like You Pretrain (FLYP), involves treating downstream class labels as text prompts and optimizing contrastive loss between image embeddings and text prompt embeddings. This ensures consistency between pretraining and fine-tuning strategies, positing that such congruence can improve model performance.

Significant empirical results substantiate the paper's claims:

  • Accuracy Gains: On the WILDS-iWILDCam dataset, FLYP achieved state-of-the-art performance, surpassing the leaderboard by 2.3% ID and 2.7% OOD accuracy. Across seven OOD datasets (inclusive of WILDS and ImageNet associated shifts), FLYP improved OOD performance by 4.2% over standard fine-tuning and by more than 1% over existing state-of-the-art methods such as LP-FT.
  • Few-shot Learning: For few-shot learning benchmarks, FLYP provided significant accuracy improvements, reaching up to 4.6% accuracy gains over standard fine-tuning.

These results highlight the method's potency in diverse contexts, including distribution shifts, transfer learning, and few-shot learning, suggesting that FLYP is a robust approach for fine-tuning zero-shot classifiers.

Theoretical and Practical Implications

Theoretically, this paper sheds light on the potential advantage of maintaining consistency between pretraining and fine-tuning objectives, particularly when both tasks aim to optimize contrastive loss. This insight could serve to enhance our understanding of fine-tuning pre-trained models, helping refine strategies that leverage pre-trained architectures across various domains.

Practically, the results suggest that practitioners could adopt the FLYP methodology as a default strategy for fine-tuning vision-LLMs in scenarios requiring improved robustness and fine-grained performance without significant computational overhead. Since the method shows benefits without necessitating complex adaptations or additional computational costs beyond standard fine-tuning, it holds promise for wide applicability in real-world scenarios where robust and efficient adaptation of pre-trained models is crucial.

Future Directions

The paper invites further exploration into the principles of aligning pretraining and fine-tuning objectives beyond the specific use case of contrastive learning. It suggests a potential reevaluation of current fine-tuning practices across different domains in artificial intelligence, especially in light of the growing scale and complexity of pre-trained models. Future research could include empirical validation across different model architectures and tasks, evaluating whether the benefits observed in vision-LLMs extend to other modalities and applications. Additionally, understanding the underpinnings of why matching pretraining and fine-tuning losses leads to superior performance could offer deeper insights into model generalization and adaptation mechanisms.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Sachin Goyal (17 papers)
  2. Ananya Kumar (27 papers)
  3. Sankalp Garg (5 papers)
  4. Zico Kolter (38 papers)
  5. Aditi Raghunathan (56 papers)
Citations (112)
Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com