Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

HyperCLIP: Adapting Vision-Language models with Hypernetworks (2412.16777v1)

Published 21 Dec 2024 in cs.CV and cs.LG

Abstract: Self-supervised vision-LLMs trained with contrastive objectives form the basis of current state-of-the-art methods in AI vision tasks. The success of these models is a direct consequence of the huge web-scale datasets used to train them, but they require correspondingly large vision components to properly learn powerful and general representations from such a broad data domain. This poses a challenge for deploying large vision-LLMs, especially in resource-constrained environments. To address this, we propose an alternate vision-language architecture, called HyperCLIP, that uses a small image encoder along with a hypernetwork that dynamically adapts image encoder weights to each new set of text inputs. All three components of the model (hypernetwork, image encoder, and text encoder) are pre-trained jointly end-to-end, and with a trained HyperCLIP model, we can generate new zero-shot deployment-friendly image classifiers for any task with a single forward pass through the text encoder and hypernetwork. HyperCLIP increases the zero-shot accuracy of SigLIP trained models with small image encoders by up to 3% on ImageNet and 5% on CIFAR-100 with minimal training throughput overhead.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Victor Akinwande (9 papers)
  2. Mohammad Sadegh Norouzzadeh (3 papers)
  3. Devin Willmott (11 papers)
  4. Anna Bair (4 papers)
  5. Madan Ravi Ganesh (13 papers)
  6. J. Zico Kolter (151 papers)