Papers
Topics
Authors
Recent
2000 character limit reached

Iterated Learning Improves Compositionality in Large Vision-Language Models (2404.02145v2)

Published 2 Apr 2024 in cs.CV

Abstract: A fundamental characteristic common to both human vision and natural language is their compositional nature. Yet, despite the performance gains contributed by large vision and language pretraining, recent investigations find that most-if not all-our state-of-the-art vision-LLMs struggle at compositionality. They are unable to distinguish between images of " a girl in white facing a man in black" and "a girl in black facing a man in white". Moreover, prior work suggests that compositionality doesn't arise with scale: larger model sizes or training data don't help. This paper develops a new iterated training algorithm that incentivizes compositionality. We draw on decades of cognitive science research that identifies cultural transmission-the need to teach a new generation-as a necessary inductive prior that incentivizes humans to develop compositional languages. Specifically, we reframe vision-language contrastive learning as the Lewis Signaling Game between a vision agent and a language agent, and operationalize cultural transmission by iteratively resetting one of the agent's weights during training. After every iteration, this training paradigm induces representations that become "easier to learn", a property of compositional languages: e.g. our model trained on CC3M and CC12M improves standard CLIP by 4.7%, 4.0% respectfully in the SugarCrepe benchmark.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (72)
  1. The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288, 2023.
  2. Léon Bottou. From machine learning to machine reasoning. Machine learning, 94(2):133–149, 2014.
  3. Understanding linguistic evolution by visualizing the emergence of topographic mappings. Artificial life, 12(2):229–242, 2006.
  4. Simulating the evolution of language. Springer Science & Business Media, 2012.
  5. The cultural evolution of structured languages in an open-ended, continuous world. Cognitive science, 41(4):892–923, 2017.
  6. Compositionality and generalization in emergent languages. arXiv preprint arXiv:2004.09124, 2020.
  7. Revisiting multimodal representation in contrastive learning: from patch and token embeddings to finite discrete tokens. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15095–15104, 2023.
  8. Some controversial questions in phonological theory. Journal of linguistics, 1(2):97–138, 1965.
  9. Language evolution: Consensus and controversies. Trends in cognitive sciences, 7(7):300–307, 2003.
  10. Emergence of compositional language with deep generational transmission. arXiv preprint arXiv:1904.09067, 2019.
  11. Sequence memory constraints give rise to language-like structure through iterated learning. PloS one, 12(1):e0168532, 2017.
  12. MJ Cresswell. Logics and languages. 1973.
  13. Interpretable agent communication from scratch (with a generic visual processor emerging on the side). Advances in Neural Information Processing Systems, 34:26937–26949, 2021.
  14. Why is winoground hard? investigating failures in visuolinguistic compositionality. arXiv preprint arXiv:2211.00768, 2022.
  15. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  16. Faith and fate: Limits of transformers on compositionality. arXiv preprint arXiv:2305.18654, 2023.
  17. Connectionism and cognitive architecture: A critical analysis. Cognition, 28(1-2):3–71, 1988.
  18. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15180–15190, 2023.
  19. Regularisation of neural networks by enforcing lipschitz continuity. Machine Learning, 110:393–416, 2021.
  20. Agqa: A benchmark for compositional spatio-temporal reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
  21. The emergence of compositional languages for numeric concepts through iterated learning in neural agents. arXiv preprint arXiv:1910.05291, 2019.
  22. Probing image-language transformers for verb understanding. arXiv preprint arXiv:2106.09141, 2021.
  23. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  24. Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality. Advances in neural information processing systems, 2023.
  25. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023.
  26. Compositionality decomposed: How do neural networks generalise? Journal of Artificial Intelligence Research, 67:757–795, 2020.
  27. Compositionality. In Handbook of logic and language, pages 417–473. Elsevier, 1997.
  28. Action genome: Actions as compositions of spatio-temporal scene graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10236–10247, 2020.
  29. Emergent language generalization and acquisition speed are not tied to compositionality. arXiv preprint arXiv:2004.03420, 2020.
  30. Simon Kirby. Spontaneous evolution of linguistic structure-an iterated learning model of the emergence of regularity and irregularity. IEEE Transactions on Evolutionary Computation, 5(2):102–110, 2001.
  31. Cumulative cultural evolution in the laboratory: An experimental approach to the origins of structure in human language. Proceedings of the National Academy of Sciences, 105(31):10681–10686, 2008.
  32. Iterated learning and the evolution of language. Current opinion in neurobiology, 28:108–114, 2014.
  33. Natural language does not emerge’naturally’in multi-agent dialog. arXiv preprint arXiv:1706.08502, 2017.
  34. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123(1):32–73, 2017.
  35. Multi-agent cooperation and the emergence of (natural) language. arXiv preprint arXiv:1612.07182, 2016.
  36. Emergence of linguistic communication from referential games with symbolic and pixel input. arXiv preprint arXiv:1804.03984, 2018.
  37. David Lewis. Convention: A philosophical study. John Wiley & Sons, 2008.
  38. Ease-of-teaching and language structure from emergent communication. Advances in neural information processing systems, 32, 2019.
  39. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
  40. Microsoft COCO: common objects in context. In Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V, pages 740–755. Springer, 2014.
  41. Cross-modal discrete representation learning. arXiv preprint arXiv:2106.05438, 2021.
  42. Visual relationship detection with language priors. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, pages 852–869. Springer, 2016.
  43. Crepe: Can vision-language foundation models reason compositionally? arXiv preprint arXiv:2212.07796, 2022.
  44. From softmax to sparsemax: A sparse model of attention and multi-label classification. In International conference on machine learning, pages 1614–1623. PMLR, 2016.
  45. Self-distillation amplifies regularization in hilbert space. Advances in Neural Information Processing Systems, 33:3351–3361, 2020.
  46. Amy Perfors. Simulated evolution of language: a review of the field. Journal of artificial societies and social simulation, 5(2), 2002.
  47. Natural language and natural selection. Behavioral and brain sciences, 13(4):707–727, 1990.
  48. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
  49. On the spectral bias of neural networks. In International Conference on Machine Learning, pages 5301–5310. PMLR, 2019.
  50. Multi-label iterated learning for image classification with label ambiguity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4783–4793, 2022.
  51. Cola: How to adapt vision-language models to compose objects localized with attributes? Advances in Neural Information Processing Systems, 2023.
  52. Compositional languages emerge in a neural iterated learning model. arXiv preprint arXiv:2002.01365, 2020.
  53. On the role of population heterogeneity in emergent communication. arXiv preprint arXiv:2204.12982, 2022.
  54. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  55. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018.
  56. Communication leads to the emergence of sub-optimal category structures. In Proceedings of the Annual Meeting of the Cognitive Science Society, 2013.
  57. Word meanings evolve to selectively preserve distinctions on salient dimensions. Cognitive science, 39(1):212–226, 2015.
  58. Iterated learning: A framework for the emergence of language. Artificial life, 9(4):371–386, 2003.
  59. Winoground: Probing vision and language models for visio-linguistic compositionality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5238–5248, 2022.
  60. Compositionality in animals and humans. PLoS Biology, 16(8):e2006425, 2018.
  61. Iterated learning for emergent systematicity in vqa. arXiv preprint arXiv:2105.01119, 2021.
  62. Iconicity and the emergence of combinatorial structure in language. Cognitive science, 40(8):1969–1994, 2016.
  63. Cultural transmission results in convergence towards colour term universals. Proceedings of the Royal Society B: Biological Sciences, 280(1758):20123073, 2013.
  64. Binding touch to everything: Learning unified multimodal tactile representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
  65. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014.
  66. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
  67. When and why vision-language models behave like bags-of-words, and what to do about it? In The Eleventh International Conference on Learning Representations, 2022.
  68. When and why vision-language models behave like bags-of-words, and what to do about it? In International Conference on Learning Representations, 2023.
  69. A large-scale study of representation learning with the visual task adaptation benchmark. arXiv preprint arXiv:1910.04867, 2019.
  70. Self-distillation as instance-specific label smoothing. Advances in Neural Information Processing Systems, 33:2184–2195, 2020.
  71. Vl-checklist: Evaluating pre-trained vision-language models with objects, attributes and relations. arXiv preprint arXiv:2207.00221, 2022.
  72. Learning to generate scene graph from natural language supervision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1823–1834, 2021.
Citations (5)

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 1 like about this paper.