Benchmarking Zero-Shot Recognition with Vision-Language Models: Challenges on Granularity and Specificity (2306.16048v3)
Abstract: This paper presents novel benchmarks for evaluating vision-LLMs (VLMs) in zero-shot recognition, focusing on granularity and specificity. Although VLMs excel in tasks like image captioning, they face challenges in open-world settings. Our benchmarks test VLMs' consistency in understanding concepts across semantic granularity levels and their response to varying text specificity. Findings show that VLMs favor moderately fine-grained concepts and struggle with specificity, often misjudging texts that differ from their training data. Extensive evaluations reveal limitations in current VLMs, particularly in distinguishing between correct and subtly incorrect descriptions. While fine-tuning offers some improvements, it doesn't fully address these issues, highlighting the need for VLMs with enhanced generalization capabilities for real-world applications. This study provides insights into VLM limitations and suggests directions for developing more robust models.
- Chen, X. et al. PaLI: A jointly-scaled multilingual language-image model. In ICLR, 2023.
- DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers. arXiv preprint arXiv:2202.04053, 2022.
- Discovering the hidden vocabulary of dalle-2. arXiv preprint arXiv:2206.00169, 2022.
- Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP). In ICML, 2022.
- Fort, S. Pixels still beat text: Attacking the openai clip model with text patches and adversarial pixel perturbations. 2021. URL https://stanislavfort.github.io/blog/OpenAI_CLIP_stickers_and_adversarial_examples/.
- Understanding clip robustness. 2021. URL https://artofrobust.github.io/short_paper/19.pdf.
- Multimodal neurons in artificial neural networks. Distill, 2021.
- Open-Vocabulary Detection via Vision and Language Knowledge Distillation. arXiv preprint arXiv:2104.13921, 2021.
- Openclip, July 2021. URL https://doi.org/10.5281/zenodo.5143773. If you use this software, please cite it as below.
- Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. In ICML, 2021.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp. 740–755. Springer, 2014.
- Reading isn’t believing: Adversarial attacks on multi-modal neurons. ArXiv, abs/2103.10480, 2021.
- Are Multimodal Models Robust to Image and Text Perturbations? arXiv preprint arXiv:2212.08044, 2022.
- Learning Transferable Visual Models From Natural Language Supervision. In ICML, 2021.
- Meta-learning for semi-supervised few-shot classification. arXiv preprint arXiv:1803.00676, 2018.
- Contrastive learning with hard negative samples. arXiv preprint arXiv:2010.04592, 2020.
- Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252, 2015.
- Robustness Analysis of Video-Language Models Against Visual and Language Perturbations. In NeurIPS Datasets and Benchmarks Track, 2022.
- LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs. arXiv preprint arXiv:2111.02114, 2021.
- LAION-5B: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022a.
- LAION-5b: An open large-scale dataset for training next generation image-text models. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022b. URL https://openreview.net/forum?id=M3Y74vmsMcY.
- K-lite: Learning transferable visual models with external knowledge. In NeurIPS, 2022.
- FLAVA: A foundational language and vision alignment model. In CVPR, 2022.
- Winoground: Probing vision and language models for visio-linguistic compositionality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5238–5248, 2022.
- ActionCLIP: A New Paradigm for Video Action Recognition. arXiv preprint arXiv:2109.08472, 2021.
- Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks. arXiv preprint arXiv:2208.10442, 2022.
- GroupViT: Semantic Segmentation Emerges from Text Supervision. In CVPR, 2022.
- Unified contrastive learning in image-text-label space, 2022.
- CoCa: Contrastive Captioners are Image-Text Foundation Models. In CVPR, 2022.
- Florence: A New Foundation Model for Computer Vision. arXiv preprint arXiv:2111.11432, 2021.
- When and why vision-language models behave like bags-of-words, and what to do about it? arXiv preprint arXiv:2210.01936, 2022.
- Can Language Understand Depth? In ACM MM, 2022.
- Detecting Twenty-thousand Classes using Image-level Supervision. In ECCV, 2022.
- Zhenlin Xu (15 papers)
- Yi Zhu (233 papers)
- Tiffany Deng (2 papers)
- Abhay Mittal (4 papers)
- Yanbei Chen (167 papers)
- Manchen Wang (7 papers)
- Paolo Favaro (66 papers)
- Joseph Tighe (29 papers)
- Davide Modolo (30 papers)