Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
175 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ViTamin: Designing Scalable Vision Models in the Vision-Language Era (2404.02132v2)

Published 2 Apr 2024 in cs.CV

Abstract: Recent breakthroughs in vision-LLMs (VLMs) start a new page in the vision community. The VLMs provide stronger and more generalizable feature embeddings compared to those from ImageNet-pretrained models, thanks to the training on the large-scale Internet image-text pairs. However, despite the amazing achievement from the VLMs, vanilla Vision Transformers (ViTs) remain the default choice for the image encoder. Although pure transformer proves its effectiveness in the text encoding area, it remains questionable whether it is also the case for image encoding, especially considering that various types of networks are proposed on the ImageNet benchmark, which, unfortunately, are rarely studied in VLMs. Due to small data/model scale, the original conclusions of model design on ImageNet can be limited and biased. In this paper, we aim at building an evaluation protocol of vision models in the vision-language era under the contrastive language-image pretraining (CLIP) framework. We provide a comprehensive way to benchmark different vision models, covering their zero-shot performance and scalability in both model and training data sizes. To this end, we introduce ViTamin, a new vision models tailored for VLMs. ViTamin-L significantly outperforms ViT-L by 2.0% ImageNet zero-shot accuracy, when using the same publicly available DataComp-1B dataset and the same OpenCLIP training scheme. ViTamin-L presents promising results on 60 diverse benchmarks, including classification, retrieval, open-vocabulary detection and segmentation, and large multi-modal models. When further scaling up the model size, our ViTamin-XL with only 436M parameters attains 82.9% ImageNet zero-shot accuracy, surpassing 82.0% achieved by EVA-E that has ten times more parameters (4.4B).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (137)
  1. Gpt-4v(ision) system card. 2023.
  2. Getting vit in shape: Scaling laws for compute-optimal model design. NeurIPS, 2023.
  3. Flamingo: a visual language model for few-shot learning. NeurIPS, 2022.
  4. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  5. Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. NeurIPS, 2019.
  6. Winogavil: Gamified association benchmark to challenge vision-and-language models. NeurIPS, 2022.
  7. Language models are few-shot learners. NeurIPS, 2020.
  8. Semantic image segmentation with deep convolutional nets and fully connected crfs. In ICLR, 2015a.
  9. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI, 40(4):834–848, 2017.
  10. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, 2018.
  11. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015b.
  12. Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022a.
  13. Pali-3 vision language models: Smaller, faster, stronger. arXiv preprint arXiv:2310.09199, 2023.
  14. Mobile-former: Bridging mobilenet and transformer. In CVPR, 2022b.
  15. Panoptic-DeepLab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. In CVPR, 2020.
  16. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023.
  17. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016.
  18. Coatnet: Marrying convolution and attention for all data sizes. NeurIPS, 2021.
  19. Language modeling with gated convolutional networks. In ICML, 2017.
  20. Scaling vision transformers to 22 billion parameters. In ICML, 2023.
  21. Coconut: Modernizing coco segmentation. In CVPR, 2024.
  22. Open-vocabulary universal image segmentation with maskclip. In ICML, 2023.
  23. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  24. Convit: Improving vision transformers with soft convolutional inductive biases. In ICML, 2021.
  25. The pascal visual object classes (voc) challenge. IJCV, 88:303–338, 2010.
  26. Multiscale vision transformers. In ICCV, 2021.
  27. Data filtering networks. arXiv preprint arXiv:2309.17425, 2023a.
  28. Eva: Exploring the limits of masked visual representation learning at scale. In CVPR, 2023b.
  29. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
  30. Datacomp: In search of the next generation of multimodal datasets. arXiv preprint arXiv:2304.14108, 2023.
  31. Scaling open-vocabulary image segmentation with image-level labels. In ECCV, 2022.
  32. Ross Girshick. Fast r-cnn. In CVPR, 2015.
  33. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017.
  34. Levit: a vision transformer in convnet’s clothing for faster inference. In CVPR, 2021.
  35. Open-vocabulary object detection via vision and language knowledge distillation. In ICLR, 2022.
  36. Vizwiz grand challenge: Answering visual questions from blind people. In CVPR, 2018.
  37. Deep residual learning for image recognition. In CVPR, 2016.
  38. Mask r-cnn. In CVPR, 2017.
  39. The many faces of robustness: A critical analysis of out-of-distribution generalization. In ICCV, 2021a.
  40. Natural adversarial examples. In CVPR, 2021b.
  41. Rethinking spatial dimensions of vision transformers. In ICCV, 2021.
  42. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  43. Searching for mobilenetv3. In ICCV, 2019.
  44. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
  45. Squeeze-and-excitation networks. In CVPR, 2018.
  46. Densely connected convolutional networks. In CVPR, 2017.
  47. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, 2019.
  48. Openclip, 2021.
  49. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
  50. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
  51. Panoptic segmentation. In CVPR, 2019.
  52. Wilds: A benchmark of in-the-wild distribution shifts. In ICML, 2021.
  53. Imagenet classification with deep convolutional neural networks. NeurIPS, 2012.
  54. F-vlm: Open-vocabulary object detection upon frozen vision and language models. In ICLR, 2023.
  55. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  56. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023a.
  57. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022a.
  58. Uniformer: Unifying convolution and self-attention for visual recognition. TPAMI, 2023b.
  59. Reclip: Resource-efficient clip by training with small images. arXiv preprint arXiv:2304.06028, 2023c.
  60. Scaling clip training with 81.1% zero-shot imagenet accuracy within a $10,000 budget; an extra $4,000 unlocks 81.8% accuracy. arXiv preprint arXiv:2306.15658, 2023d.
  61. An inverse scaling law for clip training. NeurIPS, 2023e.
  62. Mvitv2: Improved multiscale vision transformers for classification and detection. In CVPR, 2022b.
  63. Efficientformer: Vision transformers at mobilenet speed. NeurIPS, 2022c.
  64. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023f.
  65. Scaling language-image pre-training via masking. In CVPR, 2023g.
  66. Microsoft coco: Common objects in context. In ECCV, 2014.
  67. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023a.
  68. Visual instruction tuning. NeurIPS, 2023b.
  69. Learning customized visual models with retrieval-augmented knowledge. In CVPR, 2023c.
  70. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023d.
  71. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
  72. A convnet for the 2020s. In CVPR, 2022.
  73. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
  74. Learn to explain: Multimodal reasoning via thought chains for science question answering. NeurIPS, 2022.
  75. Sieve: Multimodal dataset pruning using image captioning models. arXiv preprint arXiv:2310.02110, 2023.
  76. Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer. arXiv preprint arXiv:2110.02178, 2021.
  77. Simple open-vocabulary object detection. In ECCV, 2022.
  78. The role of context for object detection and semantic segmentation in the wild. In CVPR, 2014.
  79. Slip: Self-supervision meets language-image pre-training. In ECCV, 2022.
  80. The mapillary vistas dataset for semantic understanding of street scenes. In ICCV, 2017.
  81. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  82. Learning transferable visual models from natural language supervision. In ICML, 2021.
  83. Stand-alone self-attention in vision models. NeurIPS, 2019.
  84. Do imagenet classifiers generalize to imagenet? In ICML, 2019.
  85. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  86. Imagenet large scale visual recognition challenge. IJCV, 115(3):211–252, 2015.
  87. Extending the wilds benchmark for unsupervised adaptation. In ICLR, 2022.
  88. Mobilenetv2: Inverted residuals and linear bottlenecks. In CVPR, 2018.
  89. Laion-5b: An open large-scale dataset for training next generation image-text models. NeurIPS, 2022.
  90. Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
  91. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  92. Towards vqa models that can read. In CVPR, 2019.
  93. Bottleneck transformers for visual recognition. In CVPR, 2021.
  94. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023.
  95. Remax: Relaxing for better training on efficient panoptic segmentation. NeurIPS, 2024.
  96. Going deeper with convolutions. In CVPR, 2015.
  97. Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML, 2019.
  98. Efficientnetv2: Smaller models and faster training. In ICML, 2021.
  99. Training data-efficient image transformers & distillation through attention. In ICML, 2021a.
  100. Going deeper with image transformers. In ICCV, 2021b.
  101. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  102. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  103. Maxvit: Multi-axis vision transformer. In ECCV, 2022.
  104. Attention is all you need. NeurIPS, 2017.
  105. Learning robust global representations by penalizing local predictive power. NeurIPS, 2019.
  106. Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation. In ECCV, 2020.
  107. Max-deeplab: End-to-end panoptic segmentation with mask transformers. In CVPR, 2021a.
  108. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV, 2021b.
  109. Cvt: Introducing convolutions to vision transformers. In ICCV, 2021.
  110. Tinyclip: Clip distillation via affinity mimicking and weight inheritance. In ICCV, 2023a.
  111. Clipself: Vision transformer distills itself for open-vocabulary dense prediction. arXiv preprint arXiv:2310.01403, 2023b.
  112. Early convolutions help transformers see better. NeurIPS, 2021.
  113. Aggregated residual transformations for deep neural networks. In ICCV, 2017.
  114. Open-vocabulary panoptic segmentation with text-to-image diffusion models. In CVPR, 2023.
  115. A simple baseline for zero-shot semantic segmentation with pre-trained vision-language model. In ECCV, 2022.
  116. Moat: Alternating mobile convolution and attention brings strong vision models. In ICLR, 2023.
  117. Polymax: General dense prediction with mask transformer. arXiv preprint arXiv:2311.05770, 2024.
  118. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014.
  119. The devil is in the details: A deep dive into the rabbit hole of data filtering. arXiv preprint arXiv:2309.15954, 2023a.
  120. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022a.
  121. Glance-and-gaze vision transformer. NeurIPS, 2021.
  122. Cmt-deeplab: Clustering mask transformers for panoptic segmentation. In CVPR, 2022b.
  123. k-means Mask Transformer. In ECCV, 2022c.
  124. Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip. NeurIPS, 2023b.
  125. Towards open-ended visual recognition with large language model. arXiv preprint arXiv:2311.08400, 2023c.
  126. Metaformer is actually what you need for vision. In CVPR, 2022d.
  127. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023d.
  128. Exploring the influence of information entropy change in learning systems. arXiv preprint arXiv:2309.10625, 2023e.
  129. Open-vocabulary object detection using captions. In CVPR, 2021.
  130. The visual task adaptation benchmark. 2019.
  131. Scaling vision transformers. In CVPR, 2022a.
  132. Lit: Zero-shot transfer with locked-image text tuning. In CVPR, 2022b.
  133. Sigmoid loss for language image pre-training. In ICCV, 2023.
  134. Scene parsing through ade20k dataset. In CVPR, 2017.
  135. Extract free dense labels from clip. In ECCV, 2022.
  136. Deepvit: Towards deeper vision transformer. arXiv preprint arXiv:2103.11886, 2021.
  137. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
Citations (7)

Summary

  • The paper introduces ViTamin, a novel hybrid model that refines MBConv and Transformer blocks to achieve superior zero-shot ImageNet accuracy with fewer parameters.
  • It employs a comprehensive benchmarking protocol comparing ConvNets, ViTs, and hybrids to assess performance scalability across varying data and model sizes.
  • The results suggest that integrating local feature extraction with global context modeling can lead to resource-efficient, high-performing vision-language models.

ViTamin: Advancing Vision Models for Vision-Language Tasks with New Architectures and Training Protocols

Introduction

The paper introduces ViTamin, a novel architecture designed for vision-LLMs (VLMs), aiming to optimize vision models in the context of large-scale image-text pair training. Distinct from the prevalent use of vanilla Vision Transformers (ViTs) as the default image encoder in VLMs, ViTamin proposes a tailored solution to address scalability and performance under the contrastive language-image pretraining (CLIP) framework. The paper meticulously reevaluates existing vision models, including ViTs, ConvNets, and hybrid architectures, across different scales of model parameters and training data sizes. It culminates in the development of ViTamin, showcasing remarkable improvements over existing models in zero-shot classification tasks and proposing a comprehensive benchmark for future vision model assessments in VLM tasks.

Reevaluating Vision Models in the CLIP Setting

The paper starts by challenging the status quo of employing vanilla ViTs for image encoding in VLMs. It argues that despite the effectiveness of ViTs, the ever-growing datasets for VLMs necessitate a reassessment of architectural choices, including ConvNets and hybrid models. The authors establish a new benchmarking protocol under the CLIP framework, meticulously analyzing model performance across various scales. Key findings from their comprehensive analysis indicate:

  • Scalability with data size improves performance across all models and scales, with ViTs slightly outperforming others in model parameter scalability.
  • Higher feature resolution from smaller patch sizes or fine-grained convolutions contributes positively to model performance.
  • Hybrid models, exemplified by CoAtNet, showcase superior performance to pure ConvNet or Transformer architectures, although scalability challenges arise with the largest CoAtNet variant.

ViTamin: Design and Highlights

Building on these insights, ViTamin introduces a strategic architectural design that integrates the strengths of ConvNets and Transformers. The model is structured into a three-stage network with an initial convolutional stem, followed by Mobile Convolution Blocks (MBConv) in the early stages for local feature extraction, and culminating in Transformer Blocks (TFB) for global context modeling. Key innovations in ViTamin include:

  • MBConv-LN and TFB-GeGLU Blocks: At the micro-level, ViTamin refines MBConv and TFB blocks for enhanced performance and efficiency. MBConv-LN simplifies conventional MBConv by using a single LayerNorm, while TFB-GeGLU employs Gated Linear Units in FFNs for improved accuracy with fewer parameters.
  • Scalability with Simplified Design: ViTamin demonstrates significant scalability both in terms of data volume and model size. Its design allows for effective performance improvement with increased training data and supports straightforward scaling rules for creating larger model variants.
  • Superior Performance: ViTamin notably outperforms its ViT counterparts in zero-shot ImageNet accuracy and demonstrates robust performance across 60 diverse benchmarks. Impressively, ViTamin-XL, with significantly fewer parameters, achieves higher ImageNet zero-shot accuracy than a much larger EVA-E model.

Implications and Future Directions

The introduction of ViTamin and its promising results prompt a reassessment of architectural preferences in the development of VLMs. The findings encourage exploring beyond the ViT archetype, considering hybrid models that leverage both convolutional and transformer strengths. Additionally, the scalability of ViTamin, both in data and model size, underscores the potential for more resource-efficient yet highly performant VLM architectures. As the paper proposes a new suite of benchmarks for VLMs, it sets a foundation for future research to build upon, aiming for models that excel in a broader range of vision-language tasks, including open-vocabulary detection and segmentation and large multi-modal models.

In conclusion, ViTamin marks a significant step forward in the quest for optimizing vision models within the VLM paradigm. Its architectural innovations, coupled with the comprehensive benchmarking efforts, not only advance the state-of-the-art but also broaden the horizon for future explorations in AI's visual and linguistic capabilities.