Emergent Mind


Recent breakthroughs in vision-language models (VLMs) start a new page in the vision community. The VLMs provide stronger and more generalizable feature embeddings compared to those from ImageNet-pretrained models, thanks to the training on the large-scale Internet image-text pairs. However, despite the amazing achievement from the VLMs, vanilla Vision Transformers (ViTs) remain the default choice for the image encoder. Although pure transformer proves its effectiveness in the text encoding area, it remains questionable whether it is also the case for image encoding, especially considering that various types of networks are proposed on the ImageNet benchmark, which, unfortunately, are rarely studied in VLMs. Due to small data/model scale, the original conclusions of model design on ImageNet can be limited and biased. In this paper, we aim at building an evaluation protocol of vision models in the vision-language era under the contrastive language-image pretraining (CLIP) framework. We provide a comprehensive way to benchmark different vision models, covering their zero-shot performance and scalability in both model and training data sizes. To this end, we introduce ViTamin, a new vision models tailored for VLMs. ViTamin-L significantly outperforms ViT-L by 2.0% ImageNet zero-shot accuracy, when using the same publicly available DataComp-1B dataset and the same OpenCLIP training scheme. ViTamin-L presents promising results on 60 diverse benchmarks, including classification, retrieval, open-vocabulary detection and segmentation, and large multi-modal models. When further scaling up the model size, our ViTamin-XL with only 436M parameters attains 82.9% ImageNet zero-shot accuracy, surpassing 82.0% achieved by EVA-E that has ten times more parameters (4.4B).
ViTamin architecture combines convolutional stem, MBConv, Transformer Blocks, outputs feature map with stride 16.


  • ViTamin introduces a novel architecture for vision-language models (VLMs), aiming to optimize vision models with scalable and high-performance solutions under the contrastive language-image pretraining (CLIP) framework.

  • The paper reevaluates existing vision models, including Vision Transformers (ViTs), ConvNets, and Hybrid architectures, establishing a comprehensive benchmark for VLMs.

  • ViTamin integrates the strengths of ConvNets and Transformers, featuring Mobile Convolution Blocks (MBConv) and Transformer Blocks (TFB) for improved efficiency and performance.

  • The study reveals ViTamin’s superior performance in zero-shot ImageNet accuracy and its scalability, setting a foundation for future research in vision-language tasks.


The paper introduces ViTamin, a novel architecture designed for vision-language models (VLMs), aiming to optimize vision models in the context of large-scale image-text pair training. Distinct from the prevalent use of vanilla Vision Transformers (ViTs) as the default image encoder in VLMs, ViTamin proposes a tailored solution to address scalability and performance under the contrastive language-image pretraining (CLIP) framework. The study meticulously reevaluates existing vision models, including ViTs, ConvNets, and hybrid architectures, across different scales of model parameters and training data sizes. It culminates in the development of ViTamin, showcasing remarkable improvements over existing models in zero-shot classification tasks and proposing a comprehensive benchmark for future vision model assessments in VLM tasks.

Reevaluating Vision Models in the CLIP Setting

The paper starts by challenging the status quo of employing vanilla ViTs for image encoding in VLMs. It argues that despite the effectiveness of ViTs, the ever-growing datasets for VLMs necessitate a reassessment of architectural choices, including ConvNets and hybrid models. The authors establish a new benchmarking protocol under the CLIP framework, meticulously analyzing model performance across various scales. Key findings from their comprehensive analysis indicate:

  • Scalability with data size improves performance across all models and scales, with ViTs slightly outperforming others in model parameter scalability.

  • Higher feature resolution from smaller patch sizes or fine-grained convolutions contributes positively to model performance.

  • Hybrid models, exemplified by CoAtNet, showcase superior performance to pure ConvNet or Transformer architectures, although scalability challenges arise with the largest CoAtNet variant.

ViTamin: Design and Highlights

Building on these insights, ViTamin introduces a strategic architectural design that integrates the strengths of ConvNets and Transformers. The model is structured into a three-stage network with an initial convolutional stem, followed by Mobile Convolution Blocks (MBConv) in the early stages for local feature extraction, and culminating in Transformer Blocks (TFB) for global context modeling. Key innovations in ViTamin include:

  • MBConv-LN and TFB-GeGLU Blocks: At the micro-level, ViTamin refines MBConv and TFB blocks for enhanced performance and efficiency. MBConv-LN simplifies conventional MBConv by using a single LayerNorm, while TFB-GeGLU employs Gated Linear Units in FFNs for improved accuracy with fewer parameters.

  • Scalability with Simplified Design: ViTamin demonstrates significant scalability both in terms of data volume and model size. Its design allows for effective performance improvement with increased training data and supports straightforward scaling rules for creating larger model variants.

  • Superior Performance: ViTamin notably outperforms its ViT counterparts in zero-shot ImageNet accuracy and demonstrates robust performance across 60 diverse benchmarks. Impressively, ViTamin-XL, with significantly fewer parameters, achieves higher ImageNet zero-shot accuracy than a much larger EVA-E model.

Implications and Future Directions

The introduction of ViTamin and its promising results prompt a reassessment of architectural preferences in the development of VLMs. The findings encourage exploring beyond the ViT archetype, considering hybrid models that leverage both convolutional and transformer strengths. Additionally, the scalability of ViTamin, both in data and model size, underscores the potential for more resource-efficient yet highly performant VLM architectures. As the paper proposes a new suite of benchmarks for VLMs, it sets a foundation for future research to build upon, aiming for models that excel in a broader range of vision-language tasks, including open-vocabulary detection and segmentation and large multi-modal models.

In conclusion, ViTamin marks a significant step forward in the quest for optimizing vision models within the VLM paradigm. Its architectural innovations, coupled with the comprehensive benchmarking efforts, not only advance the state-of-the-art but also broaden the horizon for future explorations in AI's visual and linguistic capabilities.

Get summaries of trending AI papers delivered straight to your inbox

Unsubscribe anytime.

  1. Gpt-4v(ision) system card. 2023.
  2. Getting vit in shape: Scaling laws for compute-optimal model design. NeurIPS
  3. Flamingo: a visual language model for few-shot learning. NeurIPS
  4. Layer Normalization
  5. Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. NeurIPS
  6. Winogavil: Gamified association benchmark to challenge vision-and-language models. NeurIPS
  7. Language models are few-shot learners. NeurIPS
  8. Semantic image segmentation with deep convolutional nets and fully connected crfs. In ICLR, 2015a.
  9. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI, 40(4):834–848
  10. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV
  11. Microsoft COCO Captions: Data Collection and Evaluation Server
  12. PaLI: A Jointly-Scaled Multilingual Language-Image Model
  13. PaLI-3 Vision Language Models: Smaller, Faster, Stronger
  14. Mobile-former: Bridging mobilenet and transformer. In CVPR, 2022b.
  15. Panoptic-DeepLab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. In CVPR
  16. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality
  17. The cityscapes dataset for semantic urban scene understanding. In CVPR
  18. Coatnet: Marrying convolution and attention for all data sizes. NeurIPS
  19. Language modeling with gated convolutional networks. In ICML
  20. Scaling vision transformers to 22 billion parameters. In ICML
  21. Coconut: Modernizing coco segmentation. In CVPR
  22. Open-vocabulary universal image segmentation with maskclip. In ICML
  23. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR
  24. Convit: Improving vision transformers with soft convolutional inductive biases. In ICML
  25. The pascal visual object classes (voc) challenge. IJCV, 88:303–338
  26. Multiscale vision transformers. In ICCV
  27. Data Filtering Networks
  28. Eva: Exploring the limits of masked visual representation learning at scale. In CVPR, 2023b.
  29. MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
  30. DataComp: In search of the next generation of multimodal datasets
  31. Scaling open-vocabulary image segmentation with image-level labels. In ECCV
  32. Ross Girshick. Fast r-cnn. In CVPR
  33. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR
  34. Levit: a vision transformer in convnet’s clothing for faster inference. In CVPR
  35. Open-vocabulary object detection via vision and language knowledge distillation. In ICLR
  36. Vizwiz grand challenge: Answering visual questions from blind people. In CVPR
  37. Deep residual learning for image recognition. In CVPR
  38. Mask r-cnn. In CVPR
  39. The many faces of robustness: A critical analysis of out-of-distribution generalization. In ICCV, 2021a.
  40. Natural adversarial examples. In CVPR, 2021b.
  41. Rethinking spatial dimensions of vision transformers. In ICCV
  42. Distilling the Knowledge in a Neural Network
  43. Searching for mobilenetv3. In ICCV
  44. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
  45. Squeeze-and-excitation networks. In CVPR
  46. Densely connected convolutional networks. In CVPR
  47. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR
  48. Openclip
  49. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML
  50. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML
  51. Panoptic segmentation. In CVPR
  52. Wilds: A benchmark of in-the-wild distribution shifts. In ICML
  53. Imagenet classification with deep convolutional neural networks. NeurIPS
  54. F-vlm: Open-vocabulary object detection upon frozen vision and language models. In ICLR
  55. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324
  56. SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
  57. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022a.
  58. Uniformer: Unifying convolution and self-attention for visual recognition. TPAMI, 2023b.
  59. RECLIP: Resource-efficient CLIP by Training with Small Images
  60. CLIPA-v2: Scaling CLIP Training with 81.1% Zero-shot ImageNet Accuracy within a \$10,000 Budget; An Extra \$4,000 Unlocks 81.8% Accuracy
  61. An inverse scaling law for clip training. NeurIPS, 2023e.
  62. Mvitv2: Improved multiscale vision transformers for classification and detection. In CVPR, 2022b.
  63. Efficientformer: Vision transformers at mobilenet speed. NeurIPS, 2022c.
  64. Evaluating Object Hallucination in Large Vision-Language Models
  65. Scaling language-image pre-training via masking. In CVPR, 2023g.
  66. Microsoft coco: Common objects in context. In ECCV
  67. Improved Baselines with Visual Instruction Tuning
  68. Visual instruction tuning. NeurIPS, 2023b.
  69. Learning customized visual models with retrieval-augmented knowledge. In CVPR, 2023c.
  70. MMBench: Is Your Multi-modal Model an All-around Player?
  71. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV
  72. A convnet for the 2020s. In CVPR
  73. Fully convolutional networks for semantic segmentation. In CVPR
  74. Learn to explain: Multimodal reasoning via thought chains for science question answering. NeurIPS
  75. SIEVE: Multimodal Dataset Pruning Using Image Captioning Models
  76. MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer
  77. Simple open-vocabulary object detection. In ECCV
  78. The role of context for object detection and semantic segmentation in the wild. In CVPR
  79. Slip: Self-supervision meets language-image pre-training. In ECCV
  80. The mapillary vistas dataset for semantic understanding of street scenes. In ICCV
  81. GPT-4 Technical Report
  82. Learning transferable visual models from natural language supervision. In ICML
  83. Stand-alone self-attention in vision models. NeurIPS
  84. Do imagenet classifiers generalize to imagenet? In ICML
  85. High-resolution image synthesis with latent diffusion models. In CVPR
  86. Imagenet large scale visual recognition challenge. IJCV, 115(3):211–252
  87. Extending the wilds benchmark for unsupervised adaptation. In ICLR
  88. Mobilenetv2: Inverted residuals and linear bottlenecks. In CVPR
  89. Laion-5b: An open large-scale dataset for training next generation image-text models. NeurIPS
  90. GLU Variants Improve Transformer
  91. Very Deep Convolutional Networks for Large-Scale Image Recognition
  92. Towards vqa models that can read. In CVPR
  93. Bottleneck transformers for visual recognition. In CVPR
  94. EVA-CLIP: Improved Training Techniques for CLIP at Scale
  95. Remax: Relaxing for better training on efficient panoptic segmentation. NeurIPS
  96. Going deeper with convolutions. In CVPR
  97. Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML
  98. Efficientnetv2: Smaller models and faster training. In ICML
  99. Training data-efficient image transformers & distillation through attention. In ICML, 2021a.
  100. Going deeper with image transformers. In ICCV, 2021b.
  101. LLaMA: Open and Efficient Foundation Language Models
  102. Llama 2: Open Foundation and Fine-Tuned Chat Models
  103. Maxvit: Multi-axis vision transformer. In ECCV
  104. Attention is all you need. NeurIPS
  105. Learning robust global representations by penalizing local predictive power. NeurIPS
  106. Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation. In ECCV
  107. Max-deeplab: End-to-end panoptic segmentation with mask transformers. In CVPR, 2021a.
  108. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV, 2021b.
  109. Cvt: Introducing convolutions to vision transformers. In ICCV
  110. Tinyclip: Clip distillation via affinity mimicking and weight inheritance. In ICCV, 2023a.
  111. CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction
  112. Early convolutions help transformers see better. NeurIPS
  113. Aggregated residual transformations for deep neural networks. In ICCV
  114. Open-vocabulary panoptic segmentation with text-to-image diffusion models. In CVPR
  115. A simple baseline for zero-shot semantic segmentation with pre-trained vision-language model. In ECCV
  116. Moat: Alternating mobile convolution and attention brings strong vision models. In ICLR
  117. PolyMaX: General Dense Prediction with Mask Transformer
  118. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78
  119. The Devil is in the Details: A Deep Dive into the Rabbit Hole of Data Filtering
  120. CoCa: Contrastive Captioners are Image-Text Foundation Models
  121. Glance-and-gaze vision transformer. NeurIPS
  122. Cmt-deeplab: Clustering mask transformers for panoptic segmentation. In CVPR, 2022b.
  123. k-means Mask Transformer. In ECCV, 2022c.
  124. Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip. NeurIPS, 2023b.
  125. Towards Open-Ended Visual Recognition with Large Language Model
  126. Metaformer is actually what you need for vision. In CVPR, 2022d.
  127. MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
  128. NoisyNN: Exploring the Influence of Information Entropy Change in Learning Systems
  129. Open-vocabulary object detection using captions. In CVPR
  130. The visual task adaptation benchmark. 2019.
  131. Scaling vision transformers. In CVPR, 2022a.
  132. Lit: Zero-shot transfer with locked-image text tuning. In CVPR, 2022b.
  133. Sigmoid loss for language image pre-training. In ICCV
  134. Scene parsing through ade20k dataset. In CVPR
  135. Extract free dense labels from clip. In ECCV
  136. DeepViT: Towards Deeper Vision Transformer
  137. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Show All 137

Test Your Knowledge

You answered out of questions correctly.

Well done!