LRR-Bench: Spatial Reasoning Benchmark

Updated 3 July 2026

LRR-Bench is a synthetic benchmark designed to assess vision-language models on both absolute 2D and complex 3D spatial understanding.
It utilizes controlled synthetic imagery and multi-frame tasks to ensure low contamination and repeatable evaluations in robotics and autonomous driving.
The benchmark exposes model shortcomings by rigorously testing object position, motion, rotation, and depth to highlight gaps versus human-level performance.

LRR-Bench is a synthetic benchmark designed to rigorously evaluate vision-LLMs (VLMs) on their ability to perform absolute and 3D spatial understanding, covering both static and dynamic spatial relations relevant for robotics, autonomous driving, and embodied perception. The benchmark introduces a low-contamination, systematically generated suite of tasks probing model performance on object position, movement, rotation, and depth, structured to expose failures and quantify progress toward robust, human-level spatial reasoning (Kong et al., 27 Jul 2025).

1. Motivation and Scope

Spatial understanding as addressed by LRR-Bench refers to a model’s capacity to accurately perceive and deduce absolute and relative positions, orientations, and trajectories of objects and cameras within single images and sequences. This competence is foundational for safety-critical applications in robotics, navigation, and manipulation, yet existing VLM benchmarks predominantly assess only basic 2D, pairwise positional queries with natural images. Major gaps identified include the absence of rigorous testing for motion and 3D rotation, limited evaluation of multi-frame and composite image scenarios, and the risk of test set contamination in natural-image-based datasets.

LRR-Bench was developed to eliminate these gaps by providing:

Comprehensive coverage of both absolute (2D, multi-frame) and 3D (motion, rotation) spatial tasks.
Synthetic imagery and controlled environments, ensuring low cost, repeatability, and no overlap with VLM pretraining datasets.
A suite of tasks tailored to stress-test both the perception and reasoning components of modern large VLMs.

2. Task Taxonomy

LRR-Bench comprises nine discrete tasks, stratified into two categories: absolute spatial understanding and 3D spatial understanding. Each task is instantiated with 200 synthetic samples (100 positive, 100 negative), ensuring balanced evaluation.

Category	Task Name	Description
Absolute Spatial (2D)	Position (Pos.)	Is object X at position P in a single image?
	Position Combination (Pos. C.)	Which corner does object X occupy within multiple grid cells?
	Position Sequence (Pos. S.)	Object X's position in each frame of an image sequence
3D Spatial Understanding	Depth (Dep.)	Is object A in front of object B in a single image?
	Camera Rotation (Ca. R.)	Consistency and directionality of camera rotations across frames
	Camera Movement (Ca. M.)	Tracking camera movement directions over image sequences
	Object Heading Direction (Obj. H. D.)	Direction of object’s “head” (e.g., sheep in Minecraft)
	Object Movement Direction (Obj. M. D.)	Whether object movement aligns with heading across frames
	Object Movement (Obj. M.)	Detecting absolute object displacement across scenes

Absolute spatial tasks probe localization and compositional spatial inference in both isolated and multi-image layouts, while 3D tasks involve dynamic reasoning over camera or object pose and motion, exploiting synthetic Minecraft environments for controlled manipulation.

3. Dataset Construction and Synthetic Pipeline

To ensure both scalability and contamination-free evaluation, LRR-Bench employs the following dataset construction methodology:

Absolute tasks (2D): Synthetic natural images are generated via the Flux.1-S diffusion model prompted for specific objects and locations. GroundingDINO provides zero-shot object detection and bounding boxes to ensure positional ground-truth and task prompt compliance. For Position Combination, composite images are constructed with random placement within a 3×3 grid.
Depth: Segmentation (Segmentation Anything Model, SAM) and Depth-Anything-V2 yield ground-truth depth ordering for binary queries.
3D and motion tasks: Minecraft serves as a controlled simulation environment, with API-level access to camera and object parameters. Controlled camera rotation (by angle θ), translation, and object heading or movement sequences are rendered as image sequences. Occlusion is algorithmically filtered. Each object-centric task (heading, movement) is available in both “Clear” (uncluttered) and standard scene contexts.

Sample generation is automated, providing deterministic partitioning, total control over confounding factors, and protection against inadvertent VLM memorization or leakage from pretraining data.

4. Evaluation Protocols and Metrics

For each of the nine tasks, model accuracy $p_i$ is assessed over the curated 200-sample test sets. Task-wise score $s_i$ is computed as:

$s_i = 2 \cdot (p_i - 50) \cdot 1[p_i \geq 50]$

where $1[\cdot]$ is the indicator function (zeroed if below chance). Overall model score $S = \sum_{i} s_i$ , with a nominal maximum of 900 (weighted to 1050 in the original paper). This metric design penalizes sub-chance “guessing” and emphasizes robust, above-baseline performance.

Two model prompting protocols are implemented:

Direct prompting: Models answer binary queries with a “Yes”/“No”.
Chain-of-Thought prompting: Models are required to generate a stepwise spatial reasoning process before issuing an answer.

Human baseline is computed from 10 volunteers (40 random samples per task), demonstrating near-perfect performance (90–100% accuracy) across all tasks, furnishing an empirical ceiling for automated models.

5. Experimental Analysis and Error Modes

Benchmarking covers 20+ state-of-the-art VLMs, including GPT-4o (mini/full), Qwen-VL2, InternVL2.5, Ovis, Llava, and SpaceOM variants, with larger models ( $>$ 40B) quantized for tractability. No external tooling (e.g., off-board depth estimation) is permitted; all spatial reasoning is end-to-end.

Key findings:

Aggregate performance: Human $S \approx 1050$ (ceiling); best-performing model, GPT-4o (with stepwise reasoning), achieves $S \approx 272.5$ ( $\sim 25\%$ of human).
2D tasks: Position (Pos.) allows near-human performance for top models ( $p \approx 90\%$ ). More compositional and sequence-based tasks (Pos. C., Pos. S.) exhibit sharp drops ( $s_i$ 0) relative to human ( $s_i$ 1).
Depth prediction: Best models reach $s_i$ 2, indicating reasonable single-frame depth sense; humans achieve $s_i$ 3.
Complex 3D tasks:
- Camera Movement (Ca. M.): Nearly all VLMs perform at near-chance or sub-chance levels ( $s_i$ 4), with highest observed at $s_i$ 5 (GPT-4o).
- Camera Rotation (Ca. R.): Best $s_i$ 6 (InternVL2.5-72B, direct), dropping to near-zero with reasoning; human $s_i$ 7.
- Object Heading/Movement Direction: Best $s_i$ 8– $s_i$ 9, with frequent hallucination of directions outside the discrete label set.
- Object Movement (Obj. M.): Best $s_i = 2 \cdot (p_i - 50) \cdot 1[p_i \geq 50]$ 0; human $s_i = 2 \cdot (p_i - 50) \cdot 1[p_i \geq 50]$ 1.

Error analyses reveal widespread overreliance on language priors (defaulting “Yes”), failure to distinguish discrete geometric states, and capacity for spurious details in chain-of-thought completions. Model size, chain-of-thought prompting, and 3D data finetuning (as in Llava-3D, SpaceQwen25, and SpaceOM) do not guarantee improvements and can introduce regressions or new failure modes.

6. Significance and Impact

LRR-Bench demonstrates stark limitations in current VLMs’ spatial reasoning:

Human-level performance is achieved only on the simplest 2D, single-image localization tasks.
Even high-parameter models and advanced prompting strategies fail systematically on composite, sequential, or 3D spatial tasks, indicating an absence of learned geometric and causal structure comprehension.
Performance does not scale reliably with model size; preference optimization (MPO), chain-of-thought, or 3D-oriented finetuning provide inconsistent or negative benefit on certain tasks.

This strongly suggests that present VLM architectures lack inductive biases or priors for multi-frame geometric perception, cannot robustly propagate spatial state over time, and default to textual heuristics absent actual scene parsing.

7. Future Research Directions

A set of prospective avenues for overcoming these deficiencies is outlined:

Architectural innovations: Explicit integration of 3D scene representations or multi-view geometric reasoning.
Training objectives: Emphasizing spatio-temporal consistency, motion cues, and scene continuity rather than isolated frame understanding.
Hybrid paradigms: Neuro-symbolic models enforcing geometric or physical constraints (e.g., rigid-body transformations) alongside neural representations.
Benchmark expansion: Incorporating blended synthetic and real-world data (e.g., from driving or robotics) to bridge simulation-to-reality gaps and further stress-test VLMs in settings aligned with real-world demands.

LRR-Bench provides a low-cost, easily extensible experimental platform enabling methodical tracking of progress in spatial reasoning and constitutes a cornerstone resource for VLM advancement in embodied AI and beyond (Kong et al., 27 Jul 2025).

Markdown Report Issue Upgrade to Chat

References (1)

LRR-Bench: Left, Right or Rotate? Vision-Language models Still Struggle With Spatial Understanding Tasks (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LRR-Bench.