Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
133 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs (2504.15280v2)

Published 21 Apr 2025 in cs.CV and cs.CL

Abstract: Multi-view understanding, the ability to reconcile visual information across diverse viewpoints for effective navigation, manipulation, and 3D scene comprehension, is a fundamental challenge in Multi-Modal LLMs (MLLMs) to be used as embodied agents. While recent MLLMs have shown impressive advances in high-level reasoning and planning, they frequently fall short when confronted with multi-view geometric consistency and cross-view correspondence. To comprehensively evaluate the challenges of MLLMs in multi-view scene reasoning, we propose All-Angles Bench, a benchmark of over 2,100 human carefully annotated multi-view question-answer pairs across 90 diverse real-world scenes. Our six tasks (counting, attribute identification, relative distance, relative direction, object manipulation, and camera pose estimation) specifically test model's geometric correspondence and the capacity to align information consistently across views. Our extensive experiments, benchmark on 27 representative MLLMs including Gemini-2.0-Flash, Claude-3.7-Sonnet, and GPT-4o against human evaluators reveals a substantial performance gap, indicating that current MLLMs remain far from human-level proficiency. Through in-depth analysis, we show that MLLMs are particularly underperforming under two aspects: (1) cross-view correspondence for partially occluded views and (2) establishing the coarse camera poses. These findings highlight the necessity of domain-specific refinements or modules that embed stronger multi-view awareness. We believe that our All-Angles Bench offers valuable insights and contribute to bridging the gap between MLLMs and human-level multi-view understanding. The project and benchmark are publicly available at https://danielchyeh.github.io/All-Angles-Bench/.

Summary

  • The paper introduces All-Angles Bench to assess MLLMs' multi-view understanding using over 2,100 QA pairs from 90 real-world scenes.
  • It demonstrates a significant performance gap, with top MLLMs scoring 50-60% accuracy compared to humans’ 82%, especially in camera pose estimation.
  • The evaluation highlights high inconsistency in paired questions, underscoring the need for domain-specific enhancements in geometric and spatial reasoning.

This paper introduces All-Angles Bench, a new benchmark designed to evaluate the multi-view understanding capabilities of Multimodal LLMs (MLLMs). The authors argue that while MLLMs show promise in high-level reasoning, they often fail at tasks requiring geometric consistency and cross-view correspondence, which are crucial for embodied AI applications like navigation and manipulation. Existing benchmarks primarily focus on single-view or temporal aspects, leaving multi-view understanding largely unevaluated.

All-Angles Bench Construction:

  • Data: The benchmark consists of over 2,100 question-answer pairs derived from 90 diverse real-world scenes sourced from the Ego-Exo4D (2403.18814) and EgoHumans (2309.16609) datasets. Each scene includes footage from at least three viewpoints.
  • Tasks: Six task categories are designed to test different facets of multi-view reasoning:

    1. Counting: Enumerating objects across views without errors due to occlusion or double-counting.
    2. Attribute Identification: Recognizing object properties consistently across different perspectives.
    3. Relative Distance: Estimating object distances from multiple viewpoints.
    4. Relative Direction: Understanding directional relationships between objects across views.
    5. Object Manipulation: Inferring changes in object states (position, orientation) across views.
    6. Camera Pose Estimation: Estimating viewpoint arrangements or scene layouts.
  • Format: Questions are multiple-choice with three options, one correct.

  • Annotation Pipeline:

1. Scenes were manually selected. 2. Initial questions were generated using an MLLM (GPT-4o). 3. Human annotators reviewed, refined, and validated questions and answers for clarity, correctness, and relevance. 4. Paired questions were created by rephrasing or altering view perspectives while preserving the core visual correspondence, designed to test model robustness and consistency. A final human check ensured quality. 85.3% of questions (excluding counting) have paired counterparts.

Experiments and Findings:

  • Evaluation: 27 MLLMs (including GPT-4o, Gemini-2.0-Flash, Claude-3.7-Sonnet, InternVL2.5-38B, Qwen2.5-VL-72B, Ovis2-34B) were evaluated. Human performance was also measured on a subset of 250 questions.
  • Performance Gap: A substantial gap exists between MLLM performance and human-level understanding (Humans: 82.0% avg accuracy vs. top MLLMs around 50-60%). MLLMs perform particularly poorly on tasks like Camera Pose Estimation, often worse than random guessing.
  • Open vs. Closed Source: Some open-source models (e.g., Ovis2-34B, Qwen2.5-VL-72B) outperformed top closed-source models on orientation-sensitive tasks (Relative Direction, Manipulation), possibly due to specialized video-focused training.
  • Paired Question Inconsistency: Analysis of paired questions revealed high inconsistency rates (where a model answers one version correctly but fails the rephrased pair). GPT-4o showed ~70% inconsistency on Relative Distance. All models struggled with Relative Direction consistency (>40% inconsistency). This suggests models often guess correctly rather than truly understanding.
  • Failure Analysis:
    • Cross-View Correspondence: MLLMs struggle to identify the same object across views, especially with partial occlusion. In counting tasks, they sometimes defaulted to reporting the maximum count from a single view instead of reconciling individuals across views.
    • Coarse Camera Estimation: Models fail to accurately estimate relative camera poses, hindering performance on tasks requiring spatial reasoning like relative direction and manipulation. Visualization prompts showed models reconstructing single views moderately well but failing to align multiple perspectives correctly.
    • Reasoning Injection (CoT): Chain-of-Thought (CoT) prompting (Zero-Shot, Self-Consistency, and a proposed Identification CoT) showed only limited and inconsistent improvements, especially for models already somewhat proficient. This suggests that linguistic reasoning strategies alone are insufficient and domain-specific architectural or training data enhancements are needed.

Conclusion:

The paper concludes that current MLLMs lack robust multi-view understanding. The All-Angles Bench effectively highlights these deficiencies, particularly in cross-view correspondence and camera pose estimation. The authors emphasize the need for domain-specific refinements or modules incorporating stronger multi-view awareness to achieve human-level performance in complex spatial reasoning tasks. The benchmark is publicly available.

Youtube Logo Streamline Icon: https://streamlinehq.com