Understanding the Limits of Vision Language Models Through the Lens of the Binding Problem (2411.00238v2)
Abstract: Recent work has documented striking heterogeneity in the performance of state-of-the-art vision LLMs (VLMs), including both multimodal LLMs and text-to-image models. These models are able to describe and generate a diverse array of complex, naturalistic images, yet they exhibit surprising failures on basic multi-object reasoning tasks -- such as counting, localization, and simple forms of visual analogy -- that humans perform with near perfect accuracy. To better understand this puzzling pattern of successes and failures, we turn to theoretical accounts of the binding problem in cognitive science and neuroscience, a fundamental problem that arises when a shared set of representational resources must be used to represent distinct entities (e.g., to represent multiple objects in an image), necessitating the use of serial processing to avoid interference. We find that many of the puzzling failures of state-of-the-art VLMs can be explained as arising due to the binding problem, and that these failure modes are strikingly similar to the limitations exhibited by rapid, feedforward processing in the human brain.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- The sparseness of mixed selectivity neurons controls the generalization–discrimination trade-off. Journal of Neuroscience, 33(9):3844–3856, 2013.
- Measuring abstract reasoning in neural networks. In International conference on machine learning, pages 511–520. PMLR, 2018.
- Monet: Unsupervised scene decomposition and representation. arXiv preprint arXiv:1901.11390, 2019.
- Subobject-level image tokenization, 2024.
- B. O. Community. Blender - a 3D modelling and rendering package. Blender Foundation, Stichting Blender Foundation, Amsterdam, 2018.
- C. Conwell and T. Ullman. Testing relational understanding in text-guided image generation. arXiv preprint arXiv:2208.00005, 2022.
- Attention over learned object embeddings enables complex visual reasoning. Advances in neural information processing systems, 34:9112–9124, 2021.
- No coincidence, george: Capacity-limits as the curse of compositionality. 2021.
- On the binding problem in artificial neural networks. arXiv preprint arXiv:2012.05208, 2020.
- K. J. Holyoak. Analogy and relational reasoning. The Oxford handbook of thinking and reasoning, pages 234–259, 2012.
- Compositional attention networks for machine reasoning. arXiv preprint arXiv:1803.03067, 2018.
- Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910, 2017.
- The discrimination of visual number. The American journal of psychology, 62(4):498–525, 1949.
- Does clip bind concepts? probing compositionality in large image models. arXiv preprint arXiv:2212.10537, 2022.
- Object-centric learning with slot attention. Advances in neural information processing systems, 33:11525–11538, 2020.
- G. Mandler and B. J. Shebo. Subitizing: an analysis of its component processes. Journal of experimental psychology: general, 111(1):1, 1982.
- B. McElree and M. Carrasco. The temporal dynamics of visual search: evidence for parallel processing in feature and conjunction searches. Journal of Experimental Psychology: Human Perception and Performance, 25(6):1517, 1999.
- G. A. Miller. The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychological review, 63(2):81, 1956.
- Comparing humans, gpt-4, and gpt-4v on abstraction and reasoning tasks. arXiv preprint arXiv:2311.09247, 2023.
- Learning to reason over visual objects. arXiv preprint arXiv:2303.02260, 2023.
- On the rational boundedness of cognitive control: Shared versus separated representations. 2023.
- Vision language models are blind. arXiv preprint arXiv:2407.06581, 2024.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
- Can generative multimodal models count to ten? In ICLR 2024 Workshop on Reliable and Responsible Foundation Models.
- J. C. Raven. Progressive matrices: A perceptual test of intelligence, individual form. London: Lewis, 1938.
- Does subitizing reflect numerical estimation? Psychological science, 19(6):607–614, 2008.
- P. R. Roelfsema. Solving the binding problem: Assemblies form when neurons enhance their firing rate—they don’t need to oscillate or synchronize. Neuron, 111(7):1003–1019, 2023.
- A. L. Roskies. The binding problem. Neuron, 24(1):7–9, 1999.
- The topography of ability and learning correlations. Advances in the psychology of human intelligence, 2(S 47):103, 1984.
- Winoground: Probing vision and language models for visio-linguistic compositionality, 2022.
- A. Treisman and H. Schmidt. Illusory conjunctions in the perception of objects. Cognitive psychology, 14(1):107–141, 1982.
- A. M. Treisman and G. Gelade. A feature-integration theory of attention. Cognitive psychology, 12(1):97–136, 1980.
- Why are small and large numbers enumerated differently? a limited-capacity preattentive stage in vision. Psychological review, 101(1):80, 1994.
- M. Vaishnav and T. Serre. Gamr: A guided attention model for (visual) reasoning. arXiv preprint arXiv:2206.04928, 2022.
- C. Von Der Malsburg. The correlation theory of brain function. In Models of neural networks: Temporal aspects of coding and information processing in biological systems, pages 95–119. Springer, 1994.
- Emergent analogical reasoning in large language models. Nature Human Behaviour, 7(9):1526–1541, 2023.
- Kiva: Kid-inspired visual analogies for testing large multimodal models. arXiv preprint arXiv:2407.17773, 2024.
- Raven: A dataset for relational and analogical visual reasoning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5317–5327, 2019.
- C. Zhang and S. Wang. Good at captioning, bad at counting: Benchmarking gpt-4v on earth observation data. arXiv preprint arXiv:2401.17600, 2024.