Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation (2504.17207v1)

Published 24 Apr 2025 in cs.CV

Abstract: We present a framework for perspective-aware reasoning in vision-LLMs (VLMs) through mental imagery simulation. Perspective-taking, the ability to perceive an environment or situation from an alternative viewpoint, is a key benchmark for human-level visual understanding, essential for environmental interaction and collaboration with autonomous agents. Despite advancements in spatial reasoning within VLMs, recent research has shown that modern VLMs significantly lack perspective-aware reasoning capabilities and exhibit a strong bias toward egocentric interpretations. To bridge the gap between VLMs and human perception, we focus on the role of mental imagery, where humans perceive the world through abstracted representations that facilitate perspective shifts. Motivated by this, we propose a framework for perspective-aware reasoning, named Abstract Perspective Change (APC), that effectively leverages vision foundation models, such as object detection, segmentation, and orientation estimation, to construct scene abstractions and enable perspective transformations. Our experiments on synthetic and real-image benchmarks, compared with various VLMs, demonstrate significant improvements in perspective-aware reasoning with our framework, further outperforming fine-tuned spatial reasoning models and novel-view-synthesis-based approaches.

Summary

  • The paper introduces APC, a framework that simulates mental imagery to enable VLMs to reason from arbitrary perspectives.
  • APC employs scene abstraction by extracting 3D positions and orientations using multiple vision models before reorienting the query.
  • Evaluations on COMFORT++ and 3DSRBench show APC outperforms baselines, achieving up to 89.67% accuracy on complex spatial tasks.

This paper addresses the significant challenge of enabling Vision-LLMs (VLMs) to perform perspective-aware spatial reasoning. Existing VLMs exhibit a strong bias towards the egocentric viewpoint of the camera, struggling to answer questions posed from alternative perspectives (allocentric reasoning). This limitation hinders their potential for applications requiring understanding environments from different viewpoints, such as autonomous navigation, robotics, and human-agent collaboration.

Inspired by the human cognitive process of mental imagery, where we form abstract internal representations of scenes to facilitate perspective shifts, the authors propose a framework called Abstract Perspective Change (APC). APC empowers VLMs to adopt arbitrary perspectives by simulating this mental imagery process and transforming the scene representation before querying the VLM.

The APC framework consists of three main stages:

  1. Scene Abstraction: Given an input image and a spatial reasoning question, APC first identifies the objects of interest relevant to the question using the VLM itself. Then, it leverages off-the-shelf vision foundation models to extract the 3D position and orientation for each identified object, including the camera. Specifically, it uses GroundingDINO [liu2024grounding] for 2D bounding boxes, SAM [kirillov2023segment] for segmentation masks, DepthPro [bochkovskii2024depth] for metric depth, and OrientAnything [wang2024orient] for object orientation. The output is a coarse 3D abstraction of the scene represented as a set of objects, each with a description, 3D position, and orientation (ti,ci,pi)(t_i, c_i, p_i) in the camera's egocentric coordinate system (SES_E).
  2. Perspective Change: The VLM is used to determine the reference object or entity from whose perspective the question is asked. APC then performs a coordinate transformation on the scene abstraction SES_E to align it with the reference viewer's egocentric coordinate system. In this new system (SAS_A), the reference viewer is placed at the origin, and its forward direction is aligned with the z-axis. This transforms the original allocentric problem into an egocentric one from the reference viewer's viewpoint.
  3. Perspective Prompting: The transformed scene abstraction SAS_A is used to generate a new prompt for the VLM. Two methods are explored for representing this information:
    • Numerical (Textual) Prompt: The 3D coordinates (cic'_i) and orientations (pip'_i) of the objects in SAS_A are directly included in a structured text prompt along with the perspective-agnostic reformulation of the original question.
    • Visual Prompt: An abstract visualization of the transformed scene abstraction SAS_A is rendered. Each object is represented by a colored cube placed at its 3D position in SAS_A. The scene is rendered from the origin (the reference viewer's position) looking down the z-axis (the reference viewer's direction). Objects behind the viewer (negative z) are not rendered. A mapping between the original object names and their assigned cube colors is also provided. The VLM is then given the rendered image and a reformulated question where object names are replaced by their corresponding cube colors (e.g., "red box").

The VLM used as the backbone in the experiments is Qwen2.5-VL [Qwen2.5-VL]. Additional implementation details include VLM-based refinement for initial object detections from GroundingDINO, outlier filtering for 3D point clouds obtained from depth maps, egocentric rephrasing of the question using the VLM, and specific rendering/normalization steps for the visual prompt.

The authors evaluate APC on two benchmarks: COMFORT++ zhang2024vision and 3DSRBench ma20243dsrbench. These benchmarks include spatial reasoning tasks like determining left/right relationships, closeness, visibility, and facing direction, all posed from a perspective different from the camera. Baselines include various pure VLMs (LLaVA-NeXT, LLaVA-OneVision, Molmo, Qwen2.5-VL, GPT-4o, Gemini-2.0-Flash), grounded VLMs (SpatialVLM, SpatialRGPT, SpatialPIN), and dense reconstruction-based methods (SpatialPIN*, ViewCrafter).

The results demonstrate that APC significantly outperforms all baselines on perspective-aware tasks across both benchmarks. For example, on COMFORT++'s left/right task, while most baselines score near chance (around 50%), APC-Vis achieves 89.67% and APC-Num achieves 88.67%. On 3DSRBench's left/right task with real images, APC-Vis and APC-Num score 72.78% and 71.92% respectively, compared to baseline performance often below 50%. The visual prompt generally slightly outperforms the numerical prompt, especially on tasks like visibility and facing, which the authors attribute to the VLM being less prone to logical errors when reasoning visually compared to reasoning solely with numerical coordinates.

Further analysis of accuracy versus the angular offset between the camera and reference viewpoint on COMFORT++ confirms APC's strong perspective awareness. While baselines show a clear performance degradation as the angular difference increases (moving from egocentric to allocentric), APC maintains consistently high accuracy across all angles.

The paper highlights that abstracting the scene into a minimal 3D representation is more effective and efficient for perspective-aware reasoning than relying on dense 3D reconstruction and novel view synthesis, which often results in noisy and inaccurate renderings that confuse the VLM.

A limitation of the current APC framework is the reliance on multiple external vision foundation models, which increases memory usage and inference time compared to running the VLM alone. Future work could explore richer scene abstractions, such as using 3D bounding boxes or coarse semantic 3D reconstructions.

In summary, APC introduces a novel and effective framework for endowing VLMs with perspective-aware reasoning capabilities by simulating mental imagery through scene abstraction and perspective-transformed prompting, showcasing significant improvements over existing methods on complex spatial tasks.