ViTamin: Designing Scalable Vision Models in the Vision-Language Era (2404.02132v2)
Abstract: Recent breakthroughs in vision-LLMs (VLMs) start a new page in the vision community. The VLMs provide stronger and more generalizable feature embeddings compared to those from ImageNet-pretrained models, thanks to the training on the large-scale Internet image-text pairs. However, despite the amazing achievement from the VLMs, vanilla Vision Transformers (ViTs) remain the default choice for the image encoder. Although pure transformer proves its effectiveness in the text encoding area, it remains questionable whether it is also the case for image encoding, especially considering that various types of networks are proposed on the ImageNet benchmark, which, unfortunately, are rarely studied in VLMs. Due to small data/model scale, the original conclusions of model design on ImageNet can be limited and biased. In this paper, we aim at building an evaluation protocol of vision models in the vision-language era under the contrastive language-image pretraining (CLIP) framework. We provide a comprehensive way to benchmark different vision models, covering their zero-shot performance and scalability in both model and training data sizes. To this end, we introduce ViTamin, a new vision models tailored for VLMs. ViTamin-L significantly outperforms ViT-L by 2.0% ImageNet zero-shot accuracy, when using the same publicly available DataComp-1B dataset and the same OpenCLIP training scheme. ViTamin-L presents promising results on 60 diverse benchmarks, including classification, retrieval, open-vocabulary detection and segmentation, and large multi-modal models. When further scaling up the model size, our ViTamin-XL with only 436M parameters attains 82.9% ImageNet zero-shot accuracy, surpassing 82.0% achieved by EVA-E that has ten times more parameters (4.4B).
- Gpt-4v(ision) system card. 2023.
- Getting vit in shape: Scaling laws for compute-optimal model design. NeurIPS, 2023.
- Flamingo: a visual language model for few-shot learning. NeurIPS, 2022.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. NeurIPS, 2019.
- Winogavil: Gamified association benchmark to challenge vision-and-language models. NeurIPS, 2022.
- Language models are few-shot learners. NeurIPS, 2020.
- Semantic image segmentation with deep convolutional nets and fully connected crfs. In ICLR, 2015a.
- Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI, 40(4):834–848, 2017.
- Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, 2018.
- Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015b.
- Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022a.
- Pali-3 vision language models: Smaller, faster, stronger. arXiv preprint arXiv:2310.09199, 2023.
- Mobile-former: Bridging mobilenet and transformer. In CVPR, 2022b.
- Panoptic-DeepLab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. In CVPR, 2020.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023.
- The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016.
- Coatnet: Marrying convolution and attention for all data sizes. NeurIPS, 2021.
- Language modeling with gated convolutional networks. In ICML, 2017.
- Scaling vision transformers to 22 billion parameters. In ICML, 2023.
- Coconut: Modernizing coco segmentation. In CVPR, 2024.
- Open-vocabulary universal image segmentation with maskclip. In ICML, 2023.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- Convit: Improving vision transformers with soft convolutional inductive biases. In ICML, 2021.
- The pascal visual object classes (voc) challenge. IJCV, 88:303–338, 2010.
- Multiscale vision transformers. In ICCV, 2021.
- Data filtering networks. arXiv preprint arXiv:2309.17425, 2023a.
- Eva: Exploring the limits of masked visual representation learning at scale. In CVPR, 2023b.
- Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
- Datacomp: In search of the next generation of multimodal datasets. arXiv preprint arXiv:2304.14108, 2023.
- Scaling open-vocabulary image segmentation with image-level labels. In ECCV, 2022.
- Ross Girshick. Fast r-cnn. In CVPR, 2015.
- Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017.
- Levit: a vision transformer in convnet’s clothing for faster inference. In CVPR, 2021.
- Open-vocabulary object detection via vision and language knowledge distillation. In ICLR, 2022.
- Vizwiz grand challenge: Answering visual questions from blind people. In CVPR, 2018.
- Deep residual learning for image recognition. In CVPR, 2016.
- Mask r-cnn. In CVPR, 2017.
- The many faces of robustness: A critical analysis of out-of-distribution generalization. In ICCV, 2021a.
- Natural adversarial examples. In CVPR, 2021b.
- Rethinking spatial dimensions of vision transformers. In ICCV, 2021.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- Searching for mobilenetv3. In ICCV, 2019.
- Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
- Squeeze-and-excitation networks. In CVPR, 2018.
- Densely connected convolutional networks. In CVPR, 2017.
- Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, 2019.
- Openclip, 2021.
- Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
- Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
- Panoptic segmentation. In CVPR, 2019.
- Wilds: A benchmark of in-the-wild distribution shifts. In ICML, 2021.
- Imagenet classification with deep convolutional neural networks. NeurIPS, 2012.
- F-vlm: Open-vocabulary object detection upon frozen vision and language models. In ICLR, 2023.
- Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023a.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022a.
- Uniformer: Unifying convolution and self-attention for visual recognition. TPAMI, 2023b.
- Reclip: Resource-efficient clip by training with small images. arXiv preprint arXiv:2304.06028, 2023c.
- Scaling clip training with 81.1% zero-shot imagenet accuracy within a $10,000 budget; an extra $4,000 unlocks 81.8% accuracy. arXiv preprint arXiv:2306.15658, 2023d.
- An inverse scaling law for clip training. NeurIPS, 2023e.
- Mvitv2: Improved multiscale vision transformers for classification and detection. In CVPR, 2022b.
- Efficientformer: Vision transformers at mobilenet speed. NeurIPS, 2022c.
- Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023f.
- Scaling language-image pre-training via masking. In CVPR, 2023g.
- Microsoft coco: Common objects in context. In ECCV, 2014.
- Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023a.
- Visual instruction tuning. NeurIPS, 2023b.
- Learning customized visual models with retrieval-augmented knowledge. In CVPR, 2023c.
- Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023d.
- Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
- A convnet for the 2020s. In CVPR, 2022.
- Fully convolutional networks for semantic segmentation. In CVPR, 2015.
- Learn to explain: Multimodal reasoning via thought chains for science question answering. NeurIPS, 2022.
- Sieve: Multimodal dataset pruning using image captioning models. arXiv preprint arXiv:2310.02110, 2023.
- Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer. arXiv preprint arXiv:2110.02178, 2021.
- Simple open-vocabulary object detection. In ECCV, 2022.
- The role of context for object detection and semantic segmentation in the wild. In CVPR, 2014.
- Slip: Self-supervision meets language-image pre-training. In ECCV, 2022.
- The mapillary vistas dataset for semantic understanding of street scenes. In ICCV, 2017.
- OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Stand-alone self-attention in vision models. NeurIPS, 2019.
- Do imagenet classifiers generalize to imagenet? In ICML, 2019.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- Imagenet large scale visual recognition challenge. IJCV, 115(3):211–252, 2015.
- Extending the wilds benchmark for unsupervised adaptation. In ICLR, 2022.
- Mobilenetv2: Inverted residuals and linear bottlenecks. In CVPR, 2018.
- Laion-5b: An open large-scale dataset for training next generation image-text models. NeurIPS, 2022.
- Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- Towards vqa models that can read. In CVPR, 2019.
- Bottleneck transformers for visual recognition. In CVPR, 2021.
- Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023.
- Remax: Relaxing for better training on efficient panoptic segmentation. NeurIPS, 2024.
- Going deeper with convolutions. In CVPR, 2015.
- Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML, 2019.
- Efficientnetv2: Smaller models and faster training. In ICML, 2021.
- Training data-efficient image transformers & distillation through attention. In ICML, 2021a.
- Going deeper with image transformers. In ICCV, 2021b.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
- Maxvit: Multi-axis vision transformer. In ECCV, 2022.
- Attention is all you need. NeurIPS, 2017.
- Learning robust global representations by penalizing local predictive power. NeurIPS, 2019.
- Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation. In ECCV, 2020.
- Max-deeplab: End-to-end panoptic segmentation with mask transformers. In CVPR, 2021a.
- Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV, 2021b.
- Cvt: Introducing convolutions to vision transformers. In ICCV, 2021.
- Tinyclip: Clip distillation via affinity mimicking and weight inheritance. In ICCV, 2023a.
- Clipself: Vision transformer distills itself for open-vocabulary dense prediction. arXiv preprint arXiv:2310.01403, 2023b.
- Early convolutions help transformers see better. NeurIPS, 2021.
- Aggregated residual transformations for deep neural networks. In ICCV, 2017.
- Open-vocabulary panoptic segmentation with text-to-image diffusion models. In CVPR, 2023.
- A simple baseline for zero-shot semantic segmentation with pre-trained vision-language model. In ECCV, 2022.
- Moat: Alternating mobile convolution and attention brings strong vision models. In ICLR, 2023.
- Polymax: General dense prediction with mask transformer. arXiv preprint arXiv:2311.05770, 2024.
- From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014.
- The devil is in the details: A deep dive into the rabbit hole of data filtering. arXiv preprint arXiv:2309.15954, 2023a.
- Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022a.
- Glance-and-gaze vision transformer. NeurIPS, 2021.
- Cmt-deeplab: Clustering mask transformers for panoptic segmentation. In CVPR, 2022b.
- k-means Mask Transformer. In ECCV, 2022c.
- Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip. NeurIPS, 2023b.
- Towards open-ended visual recognition with large language model. arXiv preprint arXiv:2311.08400, 2023c.
- Metaformer is actually what you need for vision. In CVPR, 2022d.
- Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023d.
- Exploring the influence of information entropy change in learning systems. arXiv preprint arXiv:2309.10625, 2023e.
- Open-vocabulary object detection using captions. In CVPR, 2021.
- The visual task adaptation benchmark. 2019.
- Scaling vision transformers. In CVPR, 2022a.
- Lit: Zero-shot transfer with locked-image text tuning. In CVPR, 2022b.
- Sigmoid loss for language image pre-training. In ICCV, 2023.
- Scene parsing through ade20k dataset. In CVPR, 2017.
- Extract free dense labels from clip. In ECCV, 2022.
- Deepvit: Towards deeper vision transformer. arXiv preprint arXiv:2103.11886, 2021.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.