Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Probing Conceptual Understanding of Large Visual-Language Models (2304.03659v3)

Published 7 Apr 2023 in cs.CV

Abstract: In recent years large visual-language (V+L) models have achieved great success in various downstream tasks. However, it is not well studied whether these models have a conceptual grasp of the visual content. In this work we focus on conceptual understanding of these large V+L models. To facilitate this study, we propose novel benchmarking datasets for probing three different aspects of content understanding, 1) \textit{relations}, 2) \textit{composition}, and 3) \textit{context}. Our probes are grounded in cognitive science and help determine if a V+L model can, for example, determine if snow garnished with a man is implausible, or if it can identify beach furniture by knowing it is located on a beach. We experimented with many recent state-of-the-art V+L models and observe that these models mostly \textit{fail to demonstrate} a conceptual understanding. This study reveals several interesting insights such as that \textit{cross-attention} helps learning conceptual understanding, and that CNNs are better with \textit{texture and patterns}, while Transformers are better at \textit{color and shape}. We further utilize some of these insights and investigate a \textit{simple finetuning technique} that rewards the three conceptual understanding measures with promising initial results. The proposed benchmarks will drive the community to delve deeper into conceptual understanding and foster advancements in the capabilities of large V+L models. The code and dataset is available at: \url{https://tinyurl.com/vlm-robustness}

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198, 2022.
  2. Renee Baillargeon. Object permanence in 31/2121/21 / 2-and 41/2121/21 / 2-month-old infants. Developmental psychology, 23(5):655, 1987.
  3. Natural language processing with Python: analyzing text with the natural language toolkit. " O’Reilly Media, Inc.", 2009a.
  4. Natural language processing with Python: analyzing text with the natural language toolkit. " O’Reilly Media, Inc.", 2009b.
  5. Extracting training data from diffusion models. arXiv preprint arXiv:2301.13188, 2023.
  6. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3558–3568, 2021.
  7. Large scale online learning of image similarity through ranking. Journal of Machine Learning Research, 11(3), 2010.
  8. for Social Security Administration Disability Determinations; Board on the Health of Select Populations; Institute of Medicine Committee on Psychological Testing, Including Validity Testing. Psychological testing in the service of disability determination, 2015.
  9. Redcaps: Web-curated image-text data created by the people, for the people. arXiv preprint arXiv:2111.11431, 2021.
  10. Why is winoground hard? investigating failures in visuolinguistic compositionality. arXiv preprint arXiv:2211.00768, 2022.
  11. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.
  12. Teaching structured vision&language concepts to vision&language models, 2023.
  13. Concepts and compositionality: In search of the brain’s language of thought. Annual Review of Psychology, 71(1):273–303, 2020.
  14. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. arXiv preprint arXiv:1811.12231, 2018.
  15. Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality, 2023.
  16. Patching open-vocabulary models by interpolating weights. NeurIPs, 2022.
  17. Scaling up visual and vision-language representation learning with noisy text supervision, 2021.
  18. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  19. Text encoders bottleneck compositionality in contrastive vision-language models, 2023.
  20. Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning, pages 5583–5594. PMLR, 2021.
  21. Visual genome: Connecting language and vision using crowdsourced dense image annotations. 2016.
  22. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation, 2022.
  23. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023.
  24. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  25. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  26. Crepe: Can vision-language foundation models reason compositionally?, 2023.
  27. Richard E Mayer. Models for understanding. Review of educational research, 59(1):43–64, 1989.
  28. Coco attributes: Attributes for people, animals, and objects. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VI 14, pages 85–100. Springer, 2016.
  29. Intuitive physics learning in a deep-learning model inspired by developmental psychology. Nature human behaviour, 6(9):1257–1267, 2022.
  30. Recognizing indoor scenes. In 2009 IEEE conference on computer vision and pattern recognition, pages 413–420. IEEE, 2009.
  31. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  32. Denseclip: Language-guided dense prediction with context-aware prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18082–18091, 2022.
  33. Arnaud ROUGETET. Kaggle landscape dataset. https://www.kaggle.com/datasets/arnaud58/landscape-pictures/.
  34. Arnaud Rougetet. Landscape pictures. https://www.kaggle.com/datasets/arnaud58/landscape-pictures, 2021. Accessed: February 16, 2023.
  35. Unpacking large language models with conceptual consistency. arXiv preprint arXiv:2209.15093, 2022.
  36. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
  37. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, Melbourne, Australia, 2018. Association for Computational Linguistics.
  38. Flava: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15638–15650, 2022.
  39. Conceptnet 5.5: An open multilingual graph of general knowledge. In Proceedings of the AAAI conference on artificial intelligence, 2017.
  40. Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2443–2449, 2021.
  41. Yfcc100m: The new data in multimedia research. Commun. ACM, 59(2):64–73, 2016.
  42. Winoground: Probing vision and language models for visio-linguistic compositionality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5238–5248, 2022.
  43. Cris: Clip-driven referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11686–11695, 2022.
  44. Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7959–7971, 2022.
  45. Data efficient language-supervised zero-shot recognition with optimal transport distillation, 2023.
  46. Demystifying clip data, 2023.
  47. Bridge-tower: Building bridges between encoders in vision-language representation learning. arXiv preprint arXiv:2206.08657, 2022.
  48. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014.
  49. When and why vision-language models behave like bag-of-words models, and what to do about it? arXiv preprint arXiv:2210.01936, 2022.
  50. When and why vision-language models behave like bags-of-words, and what to do about it?, 2023.
  51. Sigmoid loss for language image pre-training, 2023.
  52. Vl-checklist: Evaluating pre-trained vision-language models with objects, attributes and relations, 2023.
  53. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16816–16825, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Madeline Schiappa (2 papers)
  2. Raiyaan Abdullah (1 paper)
  3. Shehreen Azad (5 papers)
  4. Jared Claypoole (3 papers)
  5. Michael Cogswell (19 papers)
  6. Ajay Divakaran (43 papers)
  7. Yogesh Rawat (7 papers)
Citations (12)

Summary

We haven't generated a summary for this paper yet.