CLIP with Quality Captions: A Strong Pretraining for Vision Tasks (2405.08911v1)
Abstract: CLIP models perform remarkably well on zero-shot classification and retrieval tasks. But recent studies have shown that learnt representations in CLIP are not well suited for dense prediction tasks like object detection, semantic segmentation or depth estimation. More recently, multi-stage training methods for CLIP models was introduced to mitigate the weak performance of CLIP on downstream tasks. In this work, we find that simply improving the quality of captions in image-text datasets improves the quality of CLIP's visual representations, resulting in significant improvement on downstream dense prediction vision tasks. In fact, we find that CLIP pretraining with good quality captions can surpass recent supervised, self-supervised and weakly supervised pretraining methods. We show that when CLIP model with ViT-B/16 as image encoder is trained on well aligned image-text pairs it obtains 12.1% higher mIoU and 11.5% lower RMSE on semantic segmentation and depth estimation tasks over recent state-of-the-art Masked Image Modeling (MIM) pretraining methods like Masked Autoencoder (MAE). We find that mobile architectures also benefit significantly from CLIP pretraining. A recent mobile vision architecture, MCi2, with CLIP pretraining obtains similar performance as Swin-L, pretrained on ImageNet-22k for semantic segmentation task while being 6.1$\times$ smaller. Moreover, we show that improving caption quality results in $10\times$ data efficiency when finetuning for dense prediction tasks.
- BEiT: BERT pre-training of image transformers. In ICLR, 2022.
- Billion-scale pretraining with vision transformers for multi-task visual representations. In WACV, 2022.
- Cascade R-CNN: high quality object detection and instance segmentation. 2019.
- Emerging properties in self-supervised vision transformers. In ICCV, 2021a.
- Emerging properties in self-supervised vision transformers. In ICCV, 2021b.
- MMDetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155, 2019.
- A simple framework for contrastive learning of visual representations. In ICML, 2020a.
- Big self-supervised models are strong semi-supervised learners. In NeurIPS, 2020b.
- Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
- Vision transformer adapter for dense predictions. In ICLR, 2023.
- Reproducible scaling laws for contrastive language-image learning. In CVPR, 2023.
- MMSegmentation Contributors. MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation, 2020.
- Clip itself is a strong fine-tuner: Achieving 85.7 arXiv preprint arXiv:2212.06138, 2022.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- Vse++: Improving visual-semantic embeddings with hard negatives. 2018. URL https://github.com/fartashf/vsepp.
- Improving clip training with language rewrites. In NeurIPS, 2023.
- Data determines distributional robustness in contrastive language image pre-training (clip). In International Conference on Machine Learning, pages 6216–6234. PMLR, 2022.
- Datacomp: In search of the next generation of multimodal datasets. In ICCV, 2023.
- Bootstrap your own latent: A new approach to self-supervised learning. In NeurIPS, 2020.
- Mask r-cnn. In ICCV, 2017.
- Momentum contrast for unsupervised visual representation learning. In CVPR, 2020.
- Masked autoencoders are scalable vision learners. In CVPR, 2022.
- Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1314–1324, 2019.
- Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
- Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128–3137, 2015.
- Panoptic feature pyramid networks. In CVPR, 2019.
- Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539, 2014.
- Veclip: Improving clip training via visual-enriched captions, 2024.
- Efficientformer: Vision transformers at mobilenet speed. Advances in Neural Information Processing Systems, 2022.
- Visual instruction tuning. 2023.
- Exploring target representations for masked autoencoders. In ICLR, 2024.
- Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
- Catlip: Clip-level visual recognition accuracy with 2.7x faster pre-training on web-scale image-text data, 2024.
- Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and support inference from rgbd images. In ECCV, 2012.
- Improving multimodal datasets with image captioning, 2023.
- Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Vision transformers for dense prediction. 2021.
- Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS, 2015.
- Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
- Fully convolutional networks for semantic segmentation. IEEE transactions on Pattern Analysis and Machine Intelligence, 2017.
- Revisiting Weakly Supervised Pre-Training of Visual Perception Models. In CVPR, 2022.
- The effectiveness of mae pre-pretraining for billion-scale pretraining. In ICCV, 2023.
- Llama: Open and efficient foundation language models, 2023.
- Fastvit: A fast hybrid vision transformer using structural reparameterization. In ICCV, 2023a.
- An improved one millisecond mobile backbone. In CVPR, 2023b.
- Mobileclip: Fast image-text models through multi-modal reinforced training. In CVPR, June 2024.
- Internimage: Exploring large-scale vision foundation models with deformable convolutions. In CVPR, 2023.
- Improving clip fine-tuning performance. In ICCV, 2023.
- Mofi: Learning image representations from noisy entity annotated images. In ICLR, 2024.
- Unified perceptual parsing for scene understanding. In ECCV, 2018.
- Revealing the dark secrets of masked image modeling. In CVPR, 2023.
- Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
- Position prediction as an effective pretraining strategy. In ICML, 2022a.
- Scaling vision transformers. In CVPR, 2022b.
- Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
- Scene parsing through ade20k dataset. In CVPR, 2017.
- ibot: Image bert pre-training with online tokenizer. In ICLR, 2022.
- Pavan Kumar Anasosalu Vasu (11 papers)
- Hadi Pouransari (32 papers)
- Fartash Faghri (32 papers)
- Oncel Tuzel (62 papers)