MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness (2505.20426v1)

Published 26 May 2025 in cs.CV

Abstract: Understanding perspective is fundamental to human visual perception, yet the extent to which multimodal LLMs (MLLMs) internalize perspective geometry remains unclear. We introduce MMPerspective, the first benchmark specifically designed to systematically evaluate MLLMs' understanding of perspective through 10 carefully crafted tasks across three complementary dimensions: Perspective Perception, Reasoning, and Robustness. Our benchmark comprises 2,711 real-world and synthetic image instances with 5,083 question-answer pairs that probe key capabilities, such as vanishing point perception and counting, perspective type reasoning, line relationship understanding in 3D space, invariance to perspective-preserving transformations, etc. Through a comprehensive evaluation of 43 state-of-the-art MLLMs, we uncover significant limitations: while models demonstrate competence on surface-level perceptual tasks, they struggle with compositional reasoning and maintaining spatial consistency under perturbations. Our analysis further reveals intriguing patterns between model architecture, scale, and perspective capabilities, highlighting both robustness bottlenecks and the benefits of chain-of-thought prompting. MMPerspective establishes a valuable testbed for diagnosing and advancing spatial understanding in vision-language systems. Resources available at: https://yunlong10.github.io/MMPerspective/

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

Understanding Perspective through MMPerspective Benchmarking

The paper entitled "MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness" aims to systematically evaluate the capabilities of Multimodal LLMs (MLLMs) in understanding perspective geometry—a fundamental concept in human visual perception. The research introduces MMPerspective, the first-of-its-kind benchmark developed to assess MLLMs’ competence in perspective understanding through three core dimensions: Perspective Perception, Reasoning, and Robustness.

Overview of the Benchmark

MMPerspective comprises 10 meticulously designed tasks across these dimensions. It includes a dataset of 2,711 images (real-world and synthetic), accompanied by 5,083 question-answer pairs, which probe critical abilities such as vanishing point perception, perspective type reasoning, line relationship understanding, and consistency under perspective-preserving transformations. The benchmark rigorously evaluates 43 state-of-the-art MLLMs, revealing that while many models excel in basic perceptual tasks, they underperform in compositional reasoning and maintaining spatial consistency under perturbations.

Key Findings and Contributions

1. Perspective Perception: MLLMs demonstrate varying levels of proficiency in identifying geometric cues, such as vanishing points and horizon lines (HLs). Despite larger models generally showing improved capabilities, performance dips for line detection tasks highlight existing limitations in perspective recognition.

2. Perspective Reasoning: This dimension assesses higher-order geometric reasoning, such as determining perspective type (e.g., one-point vs. two-point) and reasoning about spatial relationships. While substantial improvements are noted as model sizes increase, specific reasoning tasks expose gaps in understanding complex spatial cues and maintaining consistent interpretations of 3D structures.

3. Robustness: Robustness is a critical measure of consistency when images undergo perspective-preserving manipulations, such as flipping or cropping. The paper finds robustness to be only weakly correlated with vision encoder scaling but more reliant on overall model size. This suggests that model architectures need to incorporate stronger spatial priors to sustain geometric interpretations.

Implications and Future Prospects

The findings of this paper have significant implications for both practical applications and theoretical advancements in AI. Practically, understanding and improving MLLM's geometric reasoning could bolster their application in fields requiring spatial intelligence, such as autonomous navigation, AR/VR, and architectural design automation. Theoretically, the paper underscores the necessity for more geometry-aware design in multimodal systems. The observed limitations in compositional reasoning and spatial consistency call for further architectural innovations to better align vision and LLMs for holistic spatial understanding.

Furthermore, the paper highlights the potential benefits of chain-of-thought (CoT) prompting, which enhances reasoning fidelity and model robustness consistently across tasks. Such prompting encourages stepwise deduction, fostering deeper comprehension of complex spatial questions.

In conclusion, the MMPerspective benchmark sets the stage for diagnosing and advancing spatial understanding in vision-language systems. Future research could expand upon this foundation by exploring adaptive evaluation protocols that accommodate model-specific reasoning strategies, diversifying image datasets to cover broader contexts, and integrating open-ended tasks to better capture the full spectrum of spatial reasoning required in dynamic environments.

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Find Related Papers

Authors (14)

GitHub

MMPerspective
GitHub - yunlong10/MMPerspective (2 stars)

Tweets

https://twitter.com/GptMaestro/status/1946114907658985548

https://twitter.com/YunlongTang6/status/1928967292894982489