Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

If CLIP Could Talk: Understanding Vision-Language Model Representations Through Their Preferred Concept Descriptions (2403.16442v2)

Published 25 Mar 2024 in cs.CL, cs.CV, and cs.LG

Abstract: Recent works often assume that Vision-LLM (VLM) representations are based on visual attributes like shape. However, it is unclear to what extent VLMs prioritize this information to represent concepts. We propose Extract and Explore (EX2), a novel approach to characterize textual features that are important for VLMs. EX2 uses reinforcement learning to align a LLM with VLM preferences and generates descriptions that incorporate features that are important for the VLM. Then, we inspect the descriptions to identify features that contribute to VLM representations. Using EX2, we find that spurious descriptions have a major role in VLM representations despite providing no helpful information, e.g., Click to enlarge photo of CONCEPT. More importantly, among informative descriptions, VLMs rely significantly on non-visual attributes like habitat (e.g., North America) to represent visual concepts. Also, our analysis reveals that different VLMs prioritize different attributes in their representations. Overall, we show that VLMs do not simply match images to scene descriptions and that non-visual or even spurious descriptions significantly influence their representations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. Words aren’t enough, their order matters: On the robustness of grounding visual referring expressions. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6555–6565, Online. Association for Computational Linguistics.
  2. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736.
  3. What matters in on-policy reinforcement learning? a large-scale empirical study. arXiv preprint arXiv:2006.05990.
  4. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  6. Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset.
  7. Scalable performance analysis for vision-language models. In Proceedings of the 12th Joint Conference on Lexical and Computational Semantics (*SEM 2023), pages 284–294, Toronto, Canada. Association for Computational Linguistics.
  8. PaLI: A jointly-scaled multilingual language-image model. In The Eleventh International Conference on Learning Representations.
  9. Describing textures in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3606–3613.
  10. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations.
  11. Implementation matters in deep policy gradients: A case study on ppo and trpo. arXiv preprint arXiv:2005.12729.
  12. Reza Esfandiarpoor and Stephen Bach. 2024. Follow-up differential descriptions: Language models resolve ambiguities for image classification. In The Twelfth International Conference on Learning Representations.
  13. Data filtering networks. arXiv preprint arXiv:2309.17425.
  14. Eva: Exploring the limits of masked visual representation learning at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19358–19369.
  15. Text descriptions are compressive and invariant representations for visual learning. arXiv preprint arXiv:2307.04317.
  16. Datacomp: In search of the next generation of multimodal datasets. arXiv preprint arXiv:2304.14108.
  17. What do vision transformers learn? a visual exploration. arXiv preprint arXiv:2212.06727.
  18. Lisa Anne Hendricks and Aida Nematzadeh. 2021. Probing image-language transformers for verb understanding. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3635–3644, Online. Association for Computational Linguistics.
  19. Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality. arXiv preprint arXiv:2306.14610.
  20. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR.
  21. Mistral 7b.
  22. How can we know what language models know? Transactions of the Association for Computational Linguistics, 8:423–438.
  23. What do we learn from inverting clip models? arXiv preprint arXiv:2403.02580.
  24. Novel dataset for fine-grained image categorization. In First Workshop on Fine-Grained Visual Categorization, IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, CO.
  25. 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops, pages 554–561.
  26. Does clip bind concepts? probing compositionality in large image models. arXiv preprint arXiv:2212.10537.
  27. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
  28. Desco: Learning object recognition with rich language descriptions. In Thirty-seventh Conference on Neural Information Processing Systems.
  29. An inverse scaling law for clip training. arXiv preprint arXiv:2305.07017.
  30. Visual instruction tuning. In NeurIPS.
  31. Charles Lovering and Ellie Pavlick. 2022. Unit testing for concepts in neural networks. Transactions of the Association for Computational Linguistics, 10:1193–1208.
  32. Crepe: Can vision-language foundation models reason compositionally? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10910–10921.
  33. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151.
  34. Sachit Menon and Carl Vondrick. 2023. Visual classification via description from large language models. In The Eleventh International Conference on Learning Representations.
  35. Linearly mapping from image to text space. arXiv preprint arXiv:2209.15162.
  36. Maria-Elena Nilsback and Andrew Zisserman. 2008. Automated flower classification over a large number of classes. In Indian Conference on Computer Vision, Graphics and Image Processing.
  37. Chils: Zero-shot image classification with hierarchical label sets. In International Conference on Machine Learning, pages 26342–26362. PMLR.
  38. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  39. VALSE: A task-independent benchmark for vision and language models centered on linguistic phenomena. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8253–8280, Dublin, Ireland. Association for Computational Linguistics.
  40. Cats and dogs. In IEEE Conference on Computer Vision and Pattern Recognition.
  41. Styleclip: Text-driven manipulation of stylegan imagery. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2085–2094.
  42. Roma Patel and Ellie Pavlick. 2021. Mapping language models to grounded conceptual spaces. In International conference on learning representations.
  43. Ellie Pavlick. 2022. Semantic structure in deep learning. Annual Review of Linguistics, 8:447–471.
  44. Language models as knowledge bases? arXiv preprint arXiv:1909.01066.
  45. What does a platypus look like? generating customized prompts for zero-shot image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15691–15701.
  46. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR.
  47. Is reinforcement learning (not) for natural language processing: Benchmarks, baselines, and building blocks for natural language policy optimization. In The Eleventh International Conference on Learning Representations.
  48. Waffling around for performance: Visual classification with random words and broad concepts. arXiv preprint arXiv:2306.07282.
  49. Probing conceptual understanding of large visual-language models. arXiv preprint arXiv:2304.03659.
  50. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294.
  51. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
  52. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021.
  53. Winoground: Probing vision and language models for visio-linguistic compositionality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5238–5248.
  54. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  55. The caltech-ucsd birds-200-2011 dataset.
  56. Demystifying clip data. arXiv preprint arXiv:2309.16671.
  57. Benchmarking zero-shot recognition with vision-language models: Challenges on granularity and specificity. arXiv preprint arXiv:2306.16048.
  58. Learning concise and descriptive attributes for visual recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3090–3100.
  59. Language in a bottle: Language model guided concept bottlenecks for interpretable image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19187–19197.
  60. When and why vision-language models behave like bags-of-words, and what to do about it? In The Eleventh International Conference on Learning Representations.
  61. Do vision-language pretrained models learn composable primitive concepts? arXiv preprint arXiv:2203.17271.
  62. Does vision-and-language pretraining improve lexical grounding? arXiv preprint arXiv:2109.10246.
  63. Sigmoid loss for language image pre-training. arXiv preprint arXiv:2303.15343.
  64. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Reza Esfandiarpoor (8 papers)
  2. Cristina Menghini (13 papers)
  3. Stephen H. Bach (33 papers)
Citations (3)
X Twitter Logo Streamline Icon: https://streamlinehq.com