Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Linear Spaces of Meanings: Compositional Structures in Vision-Language Models (2302.14383v3)

Published 28 Feb 2023 in cs.LG, cs.AI, cs.CL, and cs.CV

Abstract: We investigate compositional structures in data embeddings from pre-trained vision-LLMs (VLMs). Traditionally, compositionality has been associated with algebraic operations on embeddings of words from a pre-existing vocabulary. In contrast, we seek to approximate representations from an encoder as combinations of a smaller set of vectors in the embedding space. These vectors can be seen as "ideal words" for generating concepts directly within the embedding space of the model. We first present a framework for understanding compositional structures from a geometric perspective. We then explain what these compositional structures entail probabilistically in the case of VLM embeddings, providing intuitions for why they arise in practice. Finally, we empirically explore these structures in CLIP's embeddings and we evaluate their usefulness for solving different vision-language tasks such as classification, debiasing, and retrieval. Our results show that simple linear algebraic operations on embedding vectors can be used as compositional and interpretable methods for regulating the behavior of VLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Emergence of Invariance and Disentanglement in Deep Representations. arXiv:1706.01350 [cs, stat], June 2018. arXiv: 1706.01350.
  2. Analogies Explained: Towards Understanding Word Embeddings. page 9.
  3. Jacob Andreas. Measuring Compositionality in Representation Learning, Apr. 2019.
  4. A latent variable model approach to pmi-based word embeddings. Transactions of the Association for Computational Linguistics, 4:385–399, 2016.
  5. Nouns are Vectors, Adjectives are Matrices: Representing Adjective-Noun Constructions in Semantic Space. page 11.
  6. Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings, July 2016. arXiv:1607.06520 [cs, stat].
  7. Relation Induction in Word Embeddings Revisited. page 11.
  8. Understanding disentangling in \beta-VAE. arXiv:1804.03599 [cs, stat], Apr. 2018.
  9. ReduNet: A White-box Deep Network from the Principle of Maximizing Rate Reduction, Nov. 2021.
  10. Noam Chomsky. Syntactic structures. In Syntactic Structures. De Gruyter Mouton, 2009.
  11. Debiasing Vision-Language Models via Biased Prompts, Jan. 2023. arXiv:2302.00070 [cs].
  12. Stephen Clark. Vector Space Models of Lexical Meaning. In Shalom Lappin and Chris Fox, editors, The Handbook of Contemporary Semantic Theory, pages 493–522. John Wiley & Sons, Ltd, Chichester, UK, Aug. 2015.
  13. Mathematical Foundations for a Compositional Distributional Model of Meaning. page 34.
  14. “This Is My Unicorn, Fluffy”: Personalizing Frozen Vision-Language Representations. In Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors, Computer Vision – ECCV 2022, volume 13680, pages 558–577. Springer Nature Switzerland, Cham, 2022. Series Title: Lecture Notes in Computer Science.
  15. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 [cs], May 2019. arXiv: 1810.04805.
  16. Willis D Ellis. A source book of Gestalt psychology. Routledge, 2013.
  17. Kawin Ethayarajh. How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings, Sept. 2019.
  18. Towards Understanding Linear Word Analogies, Aug. 2019. arXiv:1810.04882 [cs].
  19. Jacob Feldman. Regularity-based perceptual grouping. Computational Intelligence, 13(4):582–623, 1997.
  20. The compositionality papers. Oxford University Press, 2002.
  21. Representation Theory, volume 129 of Graduate Texts in Mathematics. Springer New York, New York, NY, 2004.
  22. A Compositional and Interpretable Semantic Space. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 32–41, Denver, Colorado, 2015. Association for Computational Linguistics.
  23. A versatile benchmark for detection, pose estimation, segmentation and re-identification of clothing images. CVPR, 2019.
  24. Composition systems. Quarterly of Applied Mathematics, 60(4):707–736, 2002.
  25. Skip-Gram - Zipf + Uniform = Vector Additivity. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 69–76, Vancouver, Canada, 2017. Association for Computational Linguistics.
  26. Towards a Definition of Disentangled Representations, Dec. 2018.
  27. Compositionality decomposed: How do neural networks generalise?, Feb. 2020.
  28. Discovering states and transformations in image collections. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1383–1391, Boston, MA, USA, June 2015. IEEE.
  29. Neural word embedding as implicit matrix factorization. In Advances in Neural Information Processing Systems, pages 2177–2185, 2014.
  30. Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning, Oct. 2022. arXiv:2203.02053 [cs].
  31. Tal Linzen. Issues in evaluating semantic spaces using word analogies. In Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, pages 13–18, Berlin, Germany, 2016. Association for Computational Linguistics.
  32. Compositional Visual Generation with Composable Diffusion Models, Jan. 2023. arXiv:2206.01714 [cs].
  33. Deep Learning Face Attributes in the Wild, Sept. 2015. arXiv:1411.7766 [cs].
  34. On the Principles of Parsimony and Self-Consistency for the Emergence of Intelligence, July 2022.
  35. Learning Graph Embeddings for Open World Compositional Zero-Shot Learning, Apr. 2022.
  36. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
  37. Vector-based Models of Semantic Composition. page 9.
  38. Learning to Compose Soft Prompts for Compositional Zero-Shot Learning, Apr. 2022.
  39. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, 2014.
  40. Deep Contextualized Word Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana, 2018. Association for Computational Linguistics.
  41. Learning Transferable Visual Models From Natural Language Supervision. arXiv:2103.00020 [cs], Feb. 2021.
  42. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, June 2022.
  43. High-Resolution Image Synthesis with Latent Diffusion Models, Apr. 2022. arXiv:2112.10752 [cs].
  44. Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization, Apr. 2020. arXiv:1911.08731 [cs, stat].
  45. Additive Compositionality of Word Vectors. In Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019), pages 387–396, Hong Kong, China, 2019. Association for Computational Linguistics.
  46. BERT is to NLP what AlexNet is to CV: Can Pre-Trained Language Models Identify Analogies? In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3609–3624, Online, 2021. Association for Computational Linguistics.
  47. Attention Is All You Need. arXiv:1706.03762 [cs], Dec. 2017. arXiv: 1706.03762.
  48. Concept Algebra for Text-Controlled Vision Models, Feb. 2023. arXiv:2302.03693 [cs, stat].
  49. Breaking the Softmax Bottleneck: A High-Rank RNN Language Model, Mar. 2018.
  50. Fine-Grained Visual Comparisons with Local Learning. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 192–199, Columbus, OH, USA, June 2014. IEEE.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Matthew Trager (30 papers)
  2. Pramuditha Perera (23 papers)
  3. Luca Zancato (21 papers)
  4. Alessandro Achille (60 papers)
  5. Parminder Bhatia (50 papers)
  6. Stefano Soatto (179 papers)
Citations (21)