Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Visual cognition in multimodal large language models (2311.16093v3)

Published 27 Nov 2023 in cs.LG

Abstract: A chief goal of artificial intelligence is to build machines that think like people. Yet it has been argued that deep neural network architectures fail to accomplish this. Researchers have asserted these models' limitations in the domains of causal reasoning, intuitive physics, and intuitive psychology. Yet recent advancements, namely the rise of LLMs, particularly those designed for visual processing, have rekindled interest in the potential to emulate human-like cognitive abilities. This paper evaluates the current state of vision-based LLMs in the domains of intuitive physics, causal reasoning, and intuitive psychology. Through a series of controlled experiments, we investigate the extent to which these modern models grasp complex physical interactions, causal relationships, and intuitive understanding of others' preferences. Our findings reveal that, while some of these models demonstrate a notable proficiency in processing and interpreting visual data, they still fall short of human capabilities in these areas. Our results emphasize the need for integrating more robust mechanisms for understanding causality, physical dynamics, and social cognition into modern-day, vision-based LLMs, and point out the importance of cognitively-inspired benchmarks.

In recent years, advances in AI have led to the development of highly sophisticated models that can interpret and respond to visual and textual information—so sophisticated, in fact, that we might wonder whether these models have started to "think" like humans. In particular, vision-based LLMs, which include visual processing, have demonstrated impressive capabilities. However, research indicates that these models still do not fully emulate human cognitive processes in key areas.

The paper in focus evaluates the capabilities of several modern vision LLMs across three specific cognitive domains: intuitive physics, causal reasoning, and intuitive psychology. Intuitive physics involves predicting and understanding physical interactions; causal reasoning deals with understanding cause-and-effect relationships; and intuitive psychology involves inferring the mental states and intentions of others. Despite their complexity, these are areas where even young children demonstrate significant proficiency, suggesting that understanding and replicating these abilities is crucial for developing AI that truly mimics human thinking.

Through a series of experiments, the researchers investigated the performance of the models in tasks such as predicting the stability of block towers and inferring the potential outcomes of removing certain blocks. GPT-4, one of the largest models with a visual processing component (denoted as GPT-4V), and several other models were put to the test. They found that although models like GPT-4V were proficient at elementary tasks like identifying colors or counting objects in an image, they struggled when the tasks required more complex reasoning about physics and causality. Surprisingly, none of the models matched human performance levels in these cognitive domains.

Additionally, the models also failed to demonstrate any significant aptitude in intuitive psychology tasks, which require an understanding of others' preferences based on visual cues. The failure in this domain was noteworthy across all models tested.

The upshot is that, while modern vision-based LLMs have become quite adept at processing visual information, their capacity for deep reasoning and understanding of intuitive human concepts remains limited. The paper concludes that integrating more advanced mechanisms for causality, physical dynamics, and social cognition is necessary for further advancement. It also highlights the importance of developing benchmarks inspired by cognitive science to appropriately evaluate these AI models.

The research is a critical step in the continued effort to improve AI systems. It sheds light on current limitations and paves the way for future work exploring a broader range of cognitive domains and model variations. Nonetheless, the complexity of human cognition continues to pose a challenge to the current state of technology, reflecting the nuanced and multifaceted nature of our intellect. As AI models evolve, so too must the methods and benchmarks we use to measure their approximation of the human mind.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (137)
  1. Mitchell, M. Artificial intelligence: A guide for thinking humans (Penguin UK, 2019).
  2. Hoffmann, E. T. A. ETA Hoffmann:" Der Sandmann" (Walter de Gruyter, 1988).
  3. Weizenbaum, J. Eliza—a computer program for the study of natural language communication between man and machine. \JournalTitleCommunications of the ACM 9, 36–45 (1966).
  4. Hofstadter, D. The ineradicable eliza effect and its dangers. \JournalTitleFluid concepts and creative analogies: Computer models of the fundamental mechanisms of thought (1995).
  5. Bert: Pre-training of deep bidirectional transformers for language understanding. \JournalTitlearXiv preprint arXiv:1810.04805 (2018).
  6. Vaswani, A. et al. Attention is all you need. \JournalTitleAdvances in neural information processing systems 30 (2017).
  7. Brown, T. et al. Language models are few-shot learners. \JournalTitleAdvances in neural information processing systems 33, 1877–1901 (2020).
  8. Bubeck, S. et al. Sparks of artificial general intelligence: Early experiments with gpt-4. \JournalTitlearXiv preprint arXiv:2303.12712 (2023).
  9. Wei, J. et al. Emergent abilities of large language models. \JournalTitlearXiv preprint arXiv:2206.07682 (2022).
  10. Gpt-4 passes the bar exam. \JournalTitleAvailable at SSRN 4389233 (2023).
  11. Sawicki, P. et al. On the power of special-purpose gpt models to create and evaluate new poetry in old styles. \JournalTitlepreprint (2023).
  12. Borsos, Z. et al. Audiolm: a language modeling approach to audio generation. \JournalTitleIEEE/ACM Transactions on Audio, Speech, and Language Processing (2023).
  13. Ai-assisted coding: Experiments with gpt-4. \JournalTitlearXiv preprint arXiv:2304.13187 (2023).
  14. Kasneci, E. et al. Chatgpt for good? on opportunities and challenges of large language models for education. \JournalTitleLearning and individual differences 103, 102274 (2023).
  15. Bommasani, R. et al. On the opportunities and risks of foundation models. \JournalTitlearXiv preprint arXiv:2108.07258 (2021).
  16. Can gpt-3 pass a writer’s turing test? \JournalTitleJournal of Cultural Analytics 5 (2020).
  17. Dell’Acqua, F. et al. Navigating the jagged technological frontier: Field experimental evidence of the effects of ai on knowledge worker productivity and quality. \JournalTitleHarvard Business School Technology & Operations Mgt. Unit Working Paper (2023).
  18. Better by you, better than me? chatgpt-3 as writing assistance in students’ essays. \JournalTitlepreprint (2023).
  19. Akata, E. et al. Playing repeated games with large language models. \JournalTitlearXiv preprint arXiv:2305.16867 (2023).
  20. Simon, H. A. Cognitive science: The newest science of the artificial. \JournalTitleCognitive science 4, 33–46 (1980).
  21. Parallel distributed processing. \JournalTitleFoundations 1 (1988).
  22. Are deep neural networks adequate behavioral models of human visual perception? \JournalTitleAnnual Review of Vision Science 9 (2023).
  23. Bowers, J. S. et al. On the importance of severely testing deep learning models of cognition. \JournalTitleCognitive Systems Research 82, 101158 (2023).
  24. Marcus, G. Deep learning: A critical appraisal. \JournalTitlearXiv preprint arXiv:1801.00631 (2018).
  25. Building machines that learn and think like people. \JournalTitleBehavioral and Brain Sciences 40 (2017).
  26. Sejnowski, T. J. The deep learning revolution (MIT press, 2018).
  27. Different physical intuitions exist between tasks, not domains. \JournalTitleComputational Brain & Behavior 1, 101–118 (2018).
  28. Modeling human intuitions about liquid flow with particle-based simulation. \JournalTitlePLoS computational biology 15, e1007210 (2019).
  29. Battaglia, P. et al. Computational models of intuitive physics. In Proceedings of the Annual Meeting of the Cognitive Science Society, vol. 34 (2012).
  30. Intuitive experimentation in the physical world. \JournalTitleCognitive psychology 105, 9–38 (2018).
  31. Bayesian models of conceptual development: Learning as building models of the world. \JournalTitleAnnual Review of Developmental Psychology 2, 533–558 (2020).
  32. Mind games: Game engines as an architecture for intuitive physics. \JournalTitleTrends in cognitive sciences 21, 649–665 (2017).
  33. Internal physics models guide probabilistic judgments about object dynamics. In Proceedings of the 33rd annual conference of the cognitive science society, vol. 2 (Cognitive Science Society Austin, TX, 2011).
  34. Instability in students’ use of intuitive and newtonian models to predict motion: the critical effect of the parameters involved. \JournalTitleInternational Journal of Science Education 23, 643–660 (2001).
  35. Visual perception of relative mass in dynamic events. \JournalTitlePerception 11, 325–335 (1982).
  36. Simulation as an engine of physical scene understanding. \JournalTitleProceedings of the National Academy of Sciences 110, 18327–18332 (2013).
  37. Inferring mass in complex scenes by mental simulation. \JournalTitleCognition 157, 61–76 (2016).
  38. Phyre: A new benchmark for physical reasoning. \JournalTitleAdvances in Neural Information Processing Systems 32 (2019).
  39. Riochet, R. et al. Intphys: A framework and benchmark for visual intuitive physics reasoning. \JournalTitlearXiv preprint arXiv:1803.07616 (2018).
  40. The acquisition of physical knowledge in generative neural networks. \JournalTitleProceedings of the 40th International Conference on Machine Learning (2023).
  41. Waldmann, M. The Oxford handbook of causal reasoning (Oxford University Press, 2017).
  42. Cheng, P. W. From covariation to causation: A causal power theory. \JournalTitlePsychological review 104, 367 (1997).
  43. Causal learning and inference as a rational process: The new synthesis. \JournalTitleAnnual review of psychology 62, 135–163 (2011).
  44. Pearl, J. Causality (Cambridge university press, 2009).
  45. Theory-based causal induction. \JournalTitlePsychological review 116, 661 (2009).
  46. Beyond covariation. \JournalTitleCausal learning: Psychology, philosophy, and computation 154–172 (2007).
  47. Carey, S. On the origin of causal understanding. (1995).
  48. Gopnik, A. et al. A theory of causal learning in children: causal maps and bayes nets. \JournalTitlePsychological review 111, 3 (2004).
  49. Learning the form of causal relationships using hierarchical bayesian models. \JournalTitleCognitive Science 34, 113–147 (2010).
  50. Time in causal structure learning. \JournalTitleJournal of Experimental Psychology: Learning, Memory, and Cognition 44, 1880 (2018).
  51. Structure and strength in causal induction. \JournalTitleCognitive psychology 51, 334–384 (2005).
  52. Learning from doing: Intervention and causal inference. \JournalTitleCausal learning: Psychology, philosophy, and computation 67–85 (2007).
  53. Formalizing neurath’s ship: Approximate algorithms for online causal learning. \JournalTitlePsychological review 124, 301 (2017).
  54. Gerstenberg, T. What would have happened? counterfactuals, hypotheticals and causal judgements. \JournalTitlePhilosophical Transactions of the Royal Society B 377, 20210339 (2022).
  55. Eye-tracking causality. \JournalTitlePsychological science 28, 1731–1744 (2017).
  56. A counterfactual simulation model of causal judgments for physical events. \JournalTitlePsychological review 128, 936 (2021).
  57. Jin, Z. et al. Cladder: A benchmark to assess causal reasoning capabilities of language models. In Thirty-seventh Conference on Neural Information Processing Systems (2023).
  58. Dasgupta, I. et al. Causal reasoning from meta-reinforcement learning. \JournalTitlearXiv preprint arXiv:1901.08162 (2019).
  59. Modeling human plan recognition using bayesian theory of mind. \JournalTitlePlan, activity, and intent recognition: Theory and practice 7, 177–204 (2014).
  60. A decision network account of reasoning about other people’s choices. \JournalTitleCognition 142, 12–38 (2015).
  61. Learning from other minds: An optimistic critique of reinforcement learning models of social learning. \JournalTitleCurrent opinion in behavioral sciences 38, 110–115 (2021).
  62. Core social cognition (2013).
  63. Bayesian theory of mind: Modeling joint belief-desire attribution. In Proceedings of the annual meeting of the cognitive science society, vol. 33 (2011).
  64. Theory of mind. \JournalTitleCurrent biology 15, R644–R645 (2005).
  65. Formalizing emotion concepts within a bayesian model of theory of mind. \JournalTitleCurrent opinion in Psychology 17, 15–21 (2017).
  66. Goodman, N. D. et al. Intuitive theories of mind: A rational approach to false belief. In Proceedings of the twenty-eighth annual conference of the cognitive science society, vol. 6 (Cognitive Science Society Vancouver, 2006).
  67. Theory of minds: Understanding behavior in groups through inverse planning. In Proceedings of the AAAI conference on artificial intelligence, vol. 33, 6163–6170 (2019).
  68. Rational quantitative attribution of beliefs, desires and percepts in human mentalizing. \JournalTitleNature Human Behaviour 1, 0064 (2017).
  69. Zhi-Xuan, T. et al. Solving the baby intuitions benchmark with a hierarchically bayesian theory of mind. \JournalTitlearXiv preprint arXiv:2208.02914 (2022).
  70. Rabinowitz, N. et al. Machine theory of mind. In International conference on machine learning, 4218–4227 (PMLR, 2018).
  71. Kosinski, M. Theory of mind may have spontaneously emerged in large language models. \JournalTitlearXiv preprint arXiv:2302.02083 (2023).
  72. Ullman, T. Large language models fail on trivial alterations to theory-of-mind tasks. \JournalTitlearXiv preprint arXiv:2302.08399 (2023).
  73. Schulz, L. The origins of inquiry: Inductive inference and exploration in early childhood. \JournalTitleTrends in cognitive sciences 16, 382–389 (2012).
  74. Ullman, T. D. On the nature and origin of intuitive theories: learning, physics and psychology. Ph.D. thesis, Massachusetts Institute of Technology (2015).
  75. How to grow a mind: Statistics, structure, and abstraction. \JournalTitlescience 331, 1279–1285 (2011).
  76. Multimodal machine learning: A survey and taxonomy. \JournalTitleIEEE transactions on pattern analysis and machine intelligence 41, 423–443 (2018).
  77. Reed, S. et al. Generative adversarial text to image synthesis. In International conference on machine learning, 1060–1069 (PMLR, 2016).
  78. Wu, Q. et al. Visual question answering: A survey of methods and datasets. \JournalTitleComputer Vision and Image Understanding 163, 21–40 (2017).
  79. Visual question answering: a state-of-the-art review. \JournalTitleArtificial Intelligence Review 53, 5705–5745 (2020).
  80. OpenAI. Gpt-4 technical report (2023). 2303.08774.
  81. Bavishi, R. et al. Introducing our multimodal models (2023).
  82. Gao, P. et al. Llama-adapter v2: Parameter-efficient visual instruction model (2023). 2304.15010.
  83. Li, B. et al. Otter: A multi-modal model with in-context instruction tuning. \JournalTitlearXiv preprint arXiv:2305.03726 (2023).
  84. Awadalla, A. et al. Openflamingo: An open-source framework for training large autoregressive vision-language models. \JournalTitlearXiv preprint arXiv:2308.01390 (2023).
  85. Alayrac, J.-B. et al. Flamingo: a visual language model for few-shot learning. \JournalTitleAdvances in Neural Information Processing Systems 35, 23716–23736 (2022).
  86. Learning physical intuition of block towers by example. In International Conference on Machine Learning, 430–438 (PMLR, 2016).
  87. Mental jenga: A counterfactual simulation model of causal judgments about physical support. \JournalTitleJournal of Experimental Psychology: General (2023).
  88. The naïve utility calculus as a unified, quantitative framework for action understanding. \JournalTitleCognitive Psychology 123, 101334, DOI: https://doi.org/10.1016/j.cogpsych.2020.101334 (2020).
  89. Faulty towers: A hypothetical simulation model of physical support. In CogSci (2017).
  90. Sutton, R. The bitter lesson. \JournalTitleIncomplete Ideas (blog) 13 (2019).
  91. Kaplan, J. et al. Scaling laws for neural language models. \JournalTitlearXiv preprint arXiv:2001.08361 (2020).
  92. Prompting visual-language models for efficient video understanding. In European Conference on Computer Vision, 105–124 (Springer, 2022).
  93. Phenomenal causality and sensory realism. \JournalTitlei-Perception 11, 2041669520927038 (2020).
  94. Allen, K. R. et al. Using games to understand the mind. \JournalTitlepreprint (2023).
  95. Video-chatgpt: Towards detailed video understanding via large vision and language models. \JournalTitlearXiv preprint arXiv:2306.05424 (2023).
  96. Controversial stimuli: Pitting neural networks against each other as models of human cognition. \JournalTitleProceedings of the National Academy of Sciences 117, 29330–29337 (2020).
  97. Testing the limits of natural language models for predicting human language judgements. \JournalTitleNature Machine Intelligence 1–13 (2023).
  98. Prompt programming for large language models: Beyond the few-shot paradigm. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, 1–7 (2021).
  99. Strobelt, H. et al. Interactive and visual prompt engineering for ad-hoc task adaptation with large language models. \JournalTitlearXiv preprint arXiv:2208.07852 (2022).
  100. Do prompt-based models really understand the meaning of their prompts? \JournalTitlearXiv preprint arXiv:2109.01247 (2021).
  101. Liu, P. et al. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. \JournalTitleACM Computing Surveys 55, 1–35 (2023).
  102. Gu, J. et al. A systematic survey of prompt engineering on vision-language foundation models. \JournalTitlearXiv preprint arXiv:2307.12980 (2023).
  103. Coda-Forno, J. et al. Meta-in-context learning in large language models. \JournalTitlearXiv preprint arXiv:2305.12907 (2023).
  104. Turning large language models into cognitive models. \JournalTitlearXiv preprint arXiv:2306.03917 (2023).
  105. Towards reasoning in large language models: A survey. \JournalTitlearXiv preprint arXiv:2212.10403 (2022).
  106. Sawada, T. et al. Arb: Advanced reasoning benchmark for large language models. \JournalTitlearXiv preprint arXiv:2307.13692 (2023).
  107. Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. \JournalTitleAdvances in Neural Information Processing Systems 35, 24824–24837 (2022).
  108. Using cognitive psychology to understand gpt-3. \JournalTitleProceedings of the National Academy of Sciences 120, e2218523120 (2023).
  109. Emergent analogical reasoning in large language models. \JournalTitleNature Human Behaviour 7, 1526–1541 (2023).
  110. Coda-Forno, J. et al. Inducing anxiety in large language models increases exploration and bias. \JournalTitlearXiv preprint arXiv:2304.11111 (2023).
  111. Eisape, T. et al. A systematic comparison of syllogistic reasoning in humans and language models. \JournalTitlearXiv preprint arXiv:2311.00445 (2023).
  112. Human-like intuitive behavior and reasoning biases emerged in large language models but disappeared in chatgpt. \JournalTitleNature Computational Science 1–6 (2023).
  113. Ettinger, A. What bert is not: Lessons from a new suite of psycholinguistic diagnostics for language models. \JournalTitleTransactions of the Association for Computational Linguistics 8, 34–48 (2020).
  114. Jones, C. R. et al. Distrubutional semantics still can’t account for affordances. In Proceedings of the Annual Meeting of the Cognitive Science Society, vol. 44 (2022).
  115. Rahwan, I. et al. Machine behaviour. \JournalTitleNature 568, 477–486 (2019).
  116. Computational psychiatry for computers. \JournalTitleIscience 23 (2020).
  117. Lessons for artificial intelligence from the study of natural stupidity. \JournalTitleNature Machine Intelligence 1, 174–180 (2019).
  118. Grounding visual illusions in language: Do vision-language models perceive illusions like humans? \JournalTitlearXiv preprint arXiv:2311.00047 (2023).
  119. Comparing humans, gpt-4, and gpt-4v on abstraction and reasoning tasks (2023). 2311.09247.
  120. Causal parrots: Large language models may talk causality but are not causal. \JournalTitlearXiv preprint arXiv:2308.13067 (2023).
  121. Grounded physical language understanding with probabilistic programs and simulated worlds. In Proceedings of the Annual Meeting of the Cognitive Science Society, vol. 45 (2023).
  122. Jassim, S. et al. Grasp: A novel benchmark for evaluating language grounding and situated physics understanding in multimodal language models (2023). 2311.09048.
  123. Using cognitive psychology to understand gpt-3. \JournalTitlearXiv preprint arXiv:2206.14576 (2022).
  124. Kosoy, E. et al. Towards understanding how machines can learn causal overhypotheses. \JournalTitlearXiv preprint arXiv:2206.08353 (2022).
  125. Understanding social reasoning in language models with language models. \JournalTitlearXiv preprint arXiv:2306.15448 (2023).
  126. Srivastava, A. et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. \JournalTitlearXiv preprint arXiv:2206.04615 (2022).
  127. Human-like systematic generalization through a meta-learning neural network. \JournalTitleNature 1–7 (2023).
  128. Geirhos, R. et al. Partial success in closing the gap between human and machine vision. \JournalTitleAdvances in Neural Information Processing Systems 34, 23885–23899 (2021).
  129. Balestriero, R. et al. A cookbook of self-supervised learning. \JournalTitlearXiv preprint arXiv:2304.12210 (2023).
  130. Wong, L. et al. From word models to world models: Translating from natural language to the probabilistic language of thought. \JournalTitlearXiv preprint arXiv:2306.12672 (2023).
  131. Carta, T. et al. Grounding large language models in interactive environments with online reinforcement learning. \JournalTitlearXiv preprint arXiv:2302.02662 (2023).
  132. Paszke, A. et al. Pytorch: An imperative style, high-performance deep learning library. \JournalTitleAdvances in Neural Information Processing Systems 32 (2019).
  133. Harris, C. R. et al. Array programming with NumPy. \JournalTitleNature 585, 357–362, DOI: 10.1038/s41586-020-2649-2 (2020).
  134. pandas development team, T. pandas-dev/pandas: Pandas, DOI: 10.5281/zenodo.3509134 (2020).
  135. Virtanen, P. et al. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. \JournalTitleNature Methods 17, 261–272, DOI: 10.1038/s41592-019-0686-2 (2020).
  136. Hunter, J. D. Matplotlib: A 2d graphics environment. \JournalTitleComputing in Science & Engineering 9, 90–95, DOI: 10.1109/MCSE.2007.55 (2007).
  137. Waskom, M. L. seaborn: statistical data visualization. \JournalTitleJournal of Open Source Software 6, 3021 (2021).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Luca M. Schulze Buschoff (8 papers)
  2. Elif Akata (4 papers)
  3. Matthias Bethge (103 papers)
  4. Eric Schulz (33 papers)
Citations (3)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub

HackerNews

Reddit Logo Streamline Icon: https://streamlinehq.com

Reddit