Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LiT: Zero-Shot Transfer with Locked-image text Tuning (2111.07991v3)

Published 15 Nov 2021 in cs.CV, cs.CL, and cs.LG

Abstract: This paper presents contrastive-tuning, a simple method employing contrastive training to align image and text models while still taking advantage of their pre-training. In our empirical study we find that locked pre-trained image models with unlocked text models work best. We call this instance of contrastive-tuning "Locked-image Tuning" (LiT), which just teaches a text model to read out good representations from a pre-trained image model for new tasks. A LiT model gains the capability of zero-shot transfer to new vision tasks, such as image classification or retrieval. The proposed LiT is widely applicable; it works reliably with multiple pre-training methods (supervised and unsupervised) and across diverse architectures (ResNet, Vision Transformers and MLP-Mixer) using three different image-text datasets. With the transformer-based pre-trained ViT-g/14 model, the LiT model achieves 85.2% zero-shot transfer accuracy on the ImageNet test set, and 82.5% on the challenging out-of-distribution ObjectNet test set.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Xiaohua Zhai (51 papers)
  2. Xiao Wang (507 papers)
  3. Basil Mustafa (32 papers)
  4. Andreas Steiner (17 papers)
  5. Daniel Keysers (19 papers)
  6. Alexander Kolesnikov (44 papers)
  7. Lucas Beyer (46 papers)
Citations (488)

Summary

Overview of Zero-Shot Transfer with Locked-Image Text Tuning

This paper introduces a method called Locked-image Text Tuning (LiT), a technique for improving zero-shot transfer learning by leveraging locked pre-trained image models with unlocked text models. The approach builds on the foundation of contrastive learning and is focused on teaching the text model to effectively represent new tasks using the representations from a pre-trained image model.

Methodology

LiT employs a contrastive-tuning approach, where both image and text models are used to create embeddings. However, the key innovation lies in keeping the image model's parameters locked while allowing the text model to adapt. This separation allows the system to utilize powerful pre-trained image models without additional overhead. The use of the pre-trained ViT-g/14 model demonstrated superior zero-shot transfer accuracy on both ImageNet (85.2%) and ObjectNet (82.5%).

Key Results

The empirical paper evaluates LiT against established methods such as CLIP and ALIGN, highlighting improved data and computational efficiency. For example, on the ImageNet zero-shot transfer task, LiT shows a significant improvement over previous state-of-the-art models by 9% to 8.8%. Furthermore, LiT achieves high performance on out-of-distribution datasets without requiring learning from scratch or extensive fine-tuning.

The paper also provides insights into the design choices between locked and unlocked models, as well as various pre-trained model architectures and text encoders. A noteworthy observation is that locking the image tower enhances performance, as it keeps the generality and robustness of the image representation intact, while aligning well with the text embeddings.

Implications and Future Research

Practically, LiT facilitates the transformation of existing vision backbones into zero-shot learners with significantly lower computational resources. The method's adaptability offers potential to democratize the contribution of a wider audience in zero-shot learning research, even when using publicly available datasets and models.

Theoretically, LiT highlights the importance of decoupling the learning of image descriptors and vision-language alignment. The paper suggests that future advancements in AI could focus on further refining these decoupled processes, perhaps through hybrid models that leverage both large-scale learned representations and task-specific knowledge.

Conclusion

LiT stands as a promising method for zero-shot transfer by efficiently harnessing pre-existing models, thereby reducing computational costs and promoting wider accessibility. The results challenge traditional training paradigms and encourage further exploration into balancing existing knowledge with new task requirements across AI research fields.

Youtube Logo Streamline Icon: https://streamlinehq.com