Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement (2310.14108v1)

Published 21 Oct 2023 in cs.LG, cs.AI, and cs.CV

Abstract: Contrastive language image pretraining (CLIP) is a standard method for training vision-LLMs. While CLIP is scalable, promptable, and robust to distribution shifts on image classification tasks, it lacks object localization capabilities. This paper studies the following question: Can we augment CLIP training with task-specific vision models from model zoos to improve its visual representations? Towards this end, we leverage open-source task-specific vision models to generate pseudo-labels for an uncurated and noisy image-text dataset. Subsequently, we train CLIP models on these pseudo-labels in addition to the contrastive training on image and text pairs. This simple setup shows substantial improvements of up to 16.3% across different vision tasks, including segmentation, detection, depth estimation, and surface normal estimation. Importantly, these enhancements are achieved without compromising CLIP's existing capabilities, including its proficiency in promptable zero-shot classification.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Mohammadreza Salehi (26 papers)
  2. Mehrdad Farajtabar (56 papers)
  3. Maxwell Horton (18 papers)
  4. Fartash Faghri (32 papers)
  5. Hadi Pouransari (32 papers)
  6. Raviteja Vemulapalli (29 papers)
  7. Oncel Tuzel (62 papers)
  8. Ali Farhadi (138 papers)
  9. Mohammad Rastegari (57 papers)
  10. Sachin Mehta (48 papers)