- The paper introduces All-Angles Bench to assess MLLMs' multi-view understanding using over 2,100 QA pairs from 90 real-world scenes.
- It demonstrates a significant performance gap, with top MLLMs scoring 50-60% accuracy compared to humans’ 82%, especially in camera pose estimation.
- The evaluation highlights high inconsistency in paired questions, underscoring the need for domain-specific enhancements in geometric and spatial reasoning.
This paper introduces All-Angles Bench, a new benchmark designed to evaluate the multi-view understanding capabilities of Multimodal LLMs (MLLMs). The authors argue that while MLLMs show promise in high-level reasoning, they often fail at tasks requiring geometric consistency and cross-view correspondence, which are crucial for embodied AI applications like navigation and manipulation. Existing benchmarks primarily focus on single-view or temporal aspects, leaving multi-view understanding largely unevaluated.
All-Angles Bench Construction:
1. Scenes were manually selected.
2. Initial questions were generated using an MLLM (GPT-4o).
3. Human annotators reviewed, refined, and validated questions and answers for clarity, correctness, and relevance.
4. Paired questions were created by rephrasing or altering view perspectives while preserving the core visual correspondence, designed to test model robustness and consistency. A final human check ensured quality. 85.3% of questions (excluding counting) have paired counterparts.
Experiments and Findings:
- Evaluation: 27 MLLMs (including GPT-4o, Gemini-2.0-Flash, Claude-3.7-Sonnet, InternVL2.5-38B, Qwen2.5-VL-72B, Ovis2-34B) were evaluated. Human performance was also measured on a subset of 250 questions.
- Performance Gap: A substantial gap exists between MLLM performance and human-level understanding (Humans: 82.0% avg accuracy vs. top MLLMs around 50-60%). MLLMs perform particularly poorly on tasks like Camera Pose Estimation, often worse than random guessing.
- Open vs. Closed Source: Some open-source models (e.g., Ovis2-34B, Qwen2.5-VL-72B) outperformed top closed-source models on orientation-sensitive tasks (Relative Direction, Manipulation), possibly due to specialized video-focused training.
- Paired Question Inconsistency: Analysis of paired questions revealed high inconsistency rates (where a model answers one version correctly but fails the rephrased pair). GPT-4o showed ~70% inconsistency on Relative Distance. All models struggled with Relative Direction consistency (>40% inconsistency). This suggests models often guess correctly rather than truly understanding.
- Failure Analysis:
- Cross-View Correspondence: MLLMs struggle to identify the same object across views, especially with partial occlusion. In counting tasks, they sometimes defaulted to reporting the maximum count from a single view instead of reconciling individuals across views.
- Coarse Camera Estimation: Models fail to accurately estimate relative camera poses, hindering performance on tasks requiring spatial reasoning like relative direction and manipulation. Visualization prompts showed models reconstructing single views moderately well but failing to align multiple perspectives correctly.
- Reasoning Injection (CoT): Chain-of-Thought (CoT) prompting (Zero-Shot, Self-Consistency, and a proposed Identification CoT) showed only limited and inconsistent improvements, especially for models already somewhat proficient. This suggests that linguistic reasoning strategies alone are insufficient and domain-specific architectural or training data enhancements are needed.
Conclusion:
The paper concludes that current MLLMs lack robust multi-view understanding. The All-Angles Bench effectively highlights these deficiencies, particularly in cross-view correspondence and camera pose estimation. The authors emphasize the need for domain-specific refinements or modules incorporating stronger multi-view awareness to achieve human-level performance in complex spatial reasoning tasks. The benchmark is publicly available.