Papers
Topics
Authors
Recent
Search
2000 character limit reached

VRRPI-Bench: 3D Pose Benchmark for VLMs

Updated 2 February 2026
  • VRRPI-Bench is a benchmark assessing vision-language models' spatial reasoning by estimating 3D relative camera poses using paired image comparisons.
  • It leverages egocentric RGB-D datasets and angular bins to rigorously test discrete classification of translation and rotation directions.
  • The integrated diagnostic suite, VRRPI-Diag, isolates failure modes along individual degrees of freedom, highlighting challenges in depth, roll, and projective geometry.

VRRPI-Bench is a benchmark specifically designed to evaluate the spatial reasoning capabilities of vision-LLMs (VLMs) in the context of relative camera pose estimation (RCPE)—a fundamental computer vision task that requires inferring the 3D translation and rotation between two camera viewpoints observing the same object. Unlike prior vision-language benchmarks, VRRPI-Bench emphasizes realistic scenarios where both translation and rotation occur simultaneously around a shared object, directly reflecting challenges encountered in embodied perception and robotics. The benchmark further provides a diagnostic suite, VRRPI-Diag, to isolate failure modes along individual degrees of freedom. Through rigorous comparisons against geometric baselines, human annotators, and leading VLMs, VRRPI-Bench reveals significant deficiencies in 3D multi-view reasoning within current state-of-the-art VLMs (Deng et al., 29 Jan 2026).

1. Dataset Construction and Annotation

VRRPI-Bench is curated from publicly available egocentric RGB-D video datasets: 7 Scenes (Shotton et al. 2013; 43K frames), ScanNet (Dai et al. 2017; 2.5M frames across 1,500 scenes), and ScanNet++ (Liu et al. 2023; 460 scenes, RGB only, used for single-DoF analysis). The benchmark samples frame pairs (Ii,Ij)(I_i, I_j) that view a central object with varied camera baselines, using a sliding window to scan video streams. Candidate pairs are retained if the reprojection error dˉ\bar d falls below a task-specific threshold d+d^+ and if their angular deviation

τ=arccos((oipw)(ojpw)oipw  ojpw)\tau = \arccos\left( \frac{(\mathbf{o}_i - \mathbf{p}_w)\cdot(\mathbf{o}_j - \mathbf{p}_w)} { \|\mathbf{o}_i - \mathbf{p}_w\| \;\|\mathbf{o}_j - \mathbf{p}_w\| } \right)

falls within one of four bins centered at 15°, 30°, 45°, or 60°. Each bin is balanced to provide up to 100 frame pairs per bin per dataset. The final benchmark comprises, for 7 Scenes: 100/100/55/42 pairs at ∼15°, ∼30°, ∼45°, ∼60° (total 297); for ScanNet: 100 per bin (total 400).

For each valid pair, the exact SE(3) relative pose

Tcicj=(Twci)1Twcj=[Rt 01]\mathbf{T}_{c_ic_j} = (\mathbf{T}_{wc_i})^{-1} \mathbf{T}_{wc_j} = \begin{bmatrix} R & t \ 0 & 1 \end{bmatrix}

is decomposed into (θ,ϕ,ψ,tx,ty,tz)(\theta, \phi, \psi, t_x, t_y, t_z), where (θ,ϕ,ψ)(\theta, \phi, \psi) are Euler angles (pitch, yaw, roll) and (tx,ty,tz)(t_x, t_y, t_z) are translations in meters. This information is verbalized into compact text descriptions reflecting the dominant translation and rotation direction for the pair, e.g., “move left while yawing right.” The testbed is provided as a single, held-out evaluation set, as zero-shot evaluation is the standard for VLMs (Deng et al., 29 Jan 2026).

2. Task Definition: Relative Camera Pose Estimation

The RCPE task requires, given two images (I1,I2)(I_1, I_2) with known ground-truth transformation

Tgt=[Rgttgt 01]SE(3)T_{gt} = \begin{bmatrix} R_{gt} & t_{gt} \ 0 & 1 \end{bmatrix} \in SE(3)

(where RgtR_{gt} is an SO(3) rotation as Euler angles (θ,ϕ,ψ)(\theta, \phi, \psi) and tgtt_{gt} is a translation vector (tx,ty,tz)(t_x, t_y, t_z)), to select a discrete label summarizing the dominant sign and axis of translation and rotation (e.g., “move left & yaw right” or “move right & yaw left”). The emphasis is on inferring whether the relative motion primarily involves lateral, forward, or vertical translation, or pitch/yaw/roll rotation, and their directionality.

For evaluation, VLMs must select between two candidate descriptions that best match the “ground-truth” dominant direction, reducing continuous pose reasoning to a binary, semantically coherent classification problem suitable for both human and model assessment (Deng et al., 29 Jan 2026).

3. Evaluation Metrics and Protocol

Although classical pose estimation is assessed with continuous angular and translational errors, VRRPI-Bench adopts a discrete macro-F1 evaluation over the two candidate labels for VLMs, humans, and baselines:

  • Rotation error: eR=arccos(trace(RestTRgt)12)e_R = \arccos\left( \frac{\mathrm{trace}(R_{est}^T R_{gt}) - 1}{2} \right) (radians)
  • Translation error: et=testtesttgttgte_t = \|\frac{t_{est}}{\|t_{est}\|} - \frac{t_{gt}}{\|t_{gt}\|}\| (angular) or testtgt\|t_{est} - t_{gt}\| (metric)

Performance is reported per dataset and angle bin. Classical geometric pipelines (e.g., SIFT+RANSAC, LoFTR+RANSAC) and humans are also evaluated in this binary setting for consistent comparison. Success rates (e.g., percent correct for thresholds such as <30<30^\circ) are reported as appropriate (Deng et al., 29 Jan 2026).

4. VRRPI-Diag: Diagnostic Failure Mode Analysis

VRRPI-Diag extends VRRPI-Bench by constructing axis-isolated test subsets, where frame pairs undergo a significant motion along precisely one degree of freedom (DoF)—i.e., one of {θ,ϕ,ψ,tx,ty,tz}\{\theta, \phi, \psi, t_x, t_y, t_z\}—with the respective DoF above a large threshold δ+\delta^+, and all others below a small threshold δ\delta^-. The following parameterizations are used:

  • Pitch/yaw: (δ,δ+)=(5,15)(\delta^-, \delta^+) = (5^\circ, 15^\circ)
  • Roll: (3,10)(3^\circ, 10^\circ)
  • txt_x: (0.15,0.4)(0.15, 0.4) m
  • tyt_y: (0.1,0.3)(0.1, 0.3) m
  • tzt_z: (0.15,0.4)(0.15, 0.4) m

Subsets include, e.g., 58 pitch-only pairs (7 Scenes), 100 yaw-only pairs (ScanNet), etc. Each axis-specific task is a binary classification (e.g., “rotate left” vs “rotate right”) with macro-F1 scores reported per-axis and averaged (Deng et al., 29 Jan 2026).

5. Baseline Results and Comparative Findings

The principal baseline categories are geometric pipelines, human annotators, and VLMs. The following table summarizes key average macro-F1 scores:

Method 7 Scenes ScanNet
LoFTR 0.97 0.91
Human (n=16) 0.92 0.92
GPT-5 (VLM) 0.64 0.61

Key observations:

  • Geometric pipelines (SIFT + RANSAC, LoFTR + RANSAC) achieve near-perfect classification (LoFTR: 0.97 on 7 Scenes, 0.91 on ScanNet).
  • Human annotators (N=16) score 0.92 macro-F1, improving with larger angular baselines.
  • VLMs lag substantially; GPT-5 achieves 0.64 (7 Scenes) and 0.61 (ScanNet), while other VLMs fall in 0.35–0.55 range.
  • Multi-image consistency, measured by performance when source/target is swapped, peaks at 59.7% (GPT-5), with many models at or near random-chance (50%).
  • In VRRPI-Diag, GPT-5 achieves average macro-F1 0.90, but accuracy drops sharply to 0.47 for roll estimation. txt_x and tyt_y translations are reliably recognized (F1 ≈ 0.91), but tzt_z (looming, optical-axis translation) and roll remain challenging (F1 ≈ 0.83 and 0.47, respectively) (Deng et al., 29 Jan 2026).

6. Analysis of Failure Modes and Recommendations

Several key insights emerge regarding the limitations of current VLMs:

  1. Shallow image-plane heuristics: Success on lateral translation and yaw is explained by sensitivity to 2D image-plane pixel shifts, but failure occurs for motions requiring projective geometrical reasoning (depth, roll).
  2. Depth and roll ambiguity: Motions along the optical axis (z-translation causing scale changes/“looming”; roll inducing image rotation) require encoding of projective geometry, which is largely absent in VLMs.
  3. Logical-symmetry breakdown: VLMs often do not encode the inverse logical relationship between object and camera motion, as seen in poor consistency under source/target swaps.
  4. Disconnect between visual and textual reasoning: While chain-of-thought prompting can salvage performance, it circumvents genuine perceptual grounding in 3D.

The benchmark authors recommend a set of strategies to address VLM limitations in 3D grounding:

  • Integrating explicit geometric modules (essential-matrix solvers, depth-aware attention).
  • Augmenting training with multi-view consistency and projective-geometry constraints.
  • Incorporating learned depth and surface-normal predictors for improved optical-axis motion reasoning.
  • Designing self-supervised objectives to support recovery of SE(3) transforms across consecutive frames.

These findings unequivocally expose a gap between contemporary semantic 2D understanding in VLMs and robust 3D multi-view pose reasoning, positioning VRRPI-Bench and VRRPI-Diag as stress-tests for future VLM architecture and training developments (Deng et al., 29 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to VRRPI-Bench.