Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 90 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 41 tok/s
GPT-5 High 42 tok/s Pro
GPT-4o 109 tok/s
GPT OSS 120B 477 tok/s Pro
Kimi K2 222 tok/s Pro
2000 character limit reached

Pixels to Principles: Probing Intuitive Physics Understanding in Multimodal Language Models (2507.16572v1)

Published 22 Jul 2025 in cs.CL

Abstract: This paper presents a systematic evaluation of state-of-the-art multimodal LLMs (MLLMs) on intuitive physics tasks using the GRASP and IntPhys 2 datasets. We assess the open-source models InternVL 2.5, Qwen 2.5 VL, LLaVA-OneVision, and the proprietary Gemini 2.0 Flash Thinking, finding that even the latest models struggle to reliably distinguish physically plausible from implausible scenarios. To go beyond performance metrics, we conduct a probing analysis of model embeddings, extracting intermediate representations at key processing stages to examine how well task-relevant information is preserved. Our results show that, depending on task difficulty, a critical vision-language misalignment can emerge: vision encoders successfully capture physical plausibility cues, but this information is not effectively utilized by the LLM, leading to failures in reasoning. This misalignment suggests that the primary limitation of MLLMs in intuitive physics tasks is not the vision component but the ineffective integration of visual and linguistic information. Our findings highlight vision-language alignment as a key area for improvement, offering insights for future MLLMs development.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper demonstrates that advanced MLLMs consistently score near chance levels (around 54%) on intuitive physics tasks using GRASP and IntPhys 2.
  • A probing analysis reveals that while vision encoders efficiently capture physical cues, their integration into language models is inadequate for complex reasoning.
  • The study emphasizes the need for improved vision-language alignment to enhance MLLMs' capabilities in distinguishing plausible from implausible physical scenarios.

Probing Intuitive Physics Understanding in MLLMs

This essay explores the systematic evaluation of state-of-the-art multimodal LLMs (MLLMs) on intuitive physics tasks as presented in the paper titled "Pixels to Principles: Probing Intuitive Physics Understanding in Multimodal LLMs." The paper employs the GRASP and IntPhys 2 datasets to assess models such as InternVL 2.5, Qwen 2.5 VL, LLaVA-OneVision, and Gemini 2.0 Flash Thinking, highlighting the persistent challenges these models face in distinguishing physically plausible from implausible scenarios.

Evaluation of Multimodal LLMs

The paper examines MLLMs' capabilities in reasoning about intuitive physics using datasets that consist of simulated videos. The findings indicate that even advanced MLLMs struggle with tasks that require distinguishing between plausible and implausible physics scenarios, showing performance only marginally above chance levels. Figure 1

Figure 1

Figure 1: Average accuracy (\%) across the intuitive physics tests of GRASP and IntPhys 2 for the largest models. The dashed line represents chance performance.

Despite human performance being estimated at ~80% in similar tasks, models like Qwen 2.5 VL, InternVL 2.5, LLaVA-OneVision, and Gemini 2.0 Flash Thinking fail to exceed 54% accuracy. This indicates a fundamental limitation in MLLMs' current ability to integrate vision and language information efficiently.

Probing Analysis

In response to these limitations, the paper performs a probing analysis to explore the internal model embeddings. The probing reveals critical insights into vision-language misalignment, which emerges as a key limitation in the task performance of MLLMs. The vision encoders are proficient in capturing physical cues, but this information is inadequately leveraged by the LLM components, leading to failures in intuitive physics reasoning. Figure 2

Figure 2

Figure 2

Figure 2: Comparison of model performance across three model families, highlighting alignment and performance variations.

The paper utilizes intermediate representations from model processing stages and applies t-SNE clustering to visualize how well task-relevant information is retained through the model layers.

Vision-Language Alignment

The misalignment between vision and language components is identified as a bottleneck. Vision encoders efficiently cluster distinctions between plausible and implausible scenarios. However, this structure deteriorates after the information is processed by the LLM, suggesting that the degradation primarily occurs in the projection and alignment of vision embeddings with language representations. Figure 3

Figure 3

Figure 3

Figure 3

Figure 3: 2D t-SNE visualizations of video features (GRASP, level 2) for LLaVA-OneVision-72B, illustrating distinctions between plausible and implausible video features.

Despite advances in MLLM architecture, the inability to maintain meaningful clustering of physics concepts into the LLM decoder highlights the need for improved integration strategies.

Elementary Visual Understanding vs. Physics Reasoning

The paper further distinguishes between simple visual understanding tasks and more complex intuitive physics reasoning. It finds that models perform significantly better on simpler visual tasks, achieving near-perfect accuracy, which underscores the particular difficulty presented by intuitive physics tasks.

These findings suggest the need for more intricate alignment mechanisms to bridge the gap between vision and language modalities for complex reasoning tasks. Figure 4

Figure 4

Figure 4: Average accuracy across intuitive physics tests for smaller model sizes, showing performances closer to chance.

Conclusion

In conclusion, while there is progress in MLLMs' performance on intuitive physics tasks, substantial challenges remain. The key limitation identified is the vision-language alignment, where the integration of visual cues into language reasoning processes needs significant refinement. Future work should focus on enhancing alignment methodologies, potentially drawing from developments in both LLM architectures and vision processing paradigms. Addressing these limitations could substantially improve the efficacy of MLLMs in tasks requiring sophisticated reasoning about the physical world.