In recent years, advances in AI have led to the development of highly sophisticated models that can interpret and respond to visual and textual information—so sophisticated, in fact, that we might wonder whether these models have started to "think" like humans. In particular, vision-based LLMs, which include visual processing, have demonstrated impressive capabilities. However, research indicates that these models still do not fully emulate human cognitive processes in key areas.
The paper in focus evaluates the capabilities of several modern vision LLMs across three specific cognitive domains: intuitive physics, causal reasoning, and intuitive psychology. Intuitive physics involves predicting and understanding physical interactions; causal reasoning deals with understanding cause-and-effect relationships; and intuitive psychology involves inferring the mental states and intentions of others. Despite their complexity, these are areas where even young children demonstrate significant proficiency, suggesting that understanding and replicating these abilities is crucial for developing AI that truly mimics human thinking.
Through a series of experiments, the researchers investigated the performance of the models in tasks such as predicting the stability of block towers and inferring the potential outcomes of removing certain blocks. GPT-4, one of the largest models with a visual processing component (denoted as GPT-4V), and several other models were put to the test. They found that although models like GPT-4V were proficient at elementary tasks like identifying colors or counting objects in an image, they struggled when the tasks required more complex reasoning about physics and causality. Surprisingly, none of the models matched human performance levels in these cognitive domains.
Additionally, the models also failed to demonstrate any significant aptitude in intuitive psychology tasks, which require an understanding of others' preferences based on visual cues. The failure in this domain was noteworthy across all models tested.
The upshot is that, while modern vision-based LLMs have become quite adept at processing visual information, their capacity for deep reasoning and understanding of intuitive human concepts remains limited. The paper concludes that integrating more advanced mechanisms for causality, physical dynamics, and social cognition is necessary for further advancement. It also highlights the importance of developing benchmarks inspired by cognitive science to appropriately evaluate these AI models.
The research is a critical step in the continued effort to improve AI systems. It sheds light on current limitations and paves the way for future work exploring a broader range of cognitive domains and model variations. Nonetheless, the complexity of human cognition continues to pose a challenge to the current state of technology, reflecting the nuanced and multifaceted nature of our intellect. As AI models evolve, so too must the methods and benchmarks we use to measure their approximation of the human mind.