Papers
Topics
Authors
Recent
2000 character limit reached

EmbSpatial-Bench: LVLM & Diagnostic Benchmark

Updated 9 January 2026
  • EmbSpatial-Bench is a dual-faceted platform combining automated LVLM spatial evaluations from embodied 3D scenes with precise physical ion beam diagnostics.
  • The LVLM benchmark employs rigorous geometric relation tests across 277 scenes and 294 object categories using multiple answer extraction strategies.
  • Its physical diagnostics setup features modular instrumentation achieving sub-mm accuracy and 0.01 mrad resolution for advanced beam profiling.

EmbSpatial-Bench denotes two distinct but technically rigorous platforms—one evaluating spatial reasoning in Large Vision-LLMs (LVLMs) within embodied 3D environments, and another constituting an advanced physical diagnostics bench for spatial and emittance measurements at low ion-beam energies. Each application domain leverages precise spatial quantification frameworks, automated data curation, and standardized assessment protocols. The following sections provide comprehensive coverage of both facets and their intersectional relevance.

1. EmbSpatial-Bench for LVLM Spatial Understanding

EmbSpatial-Bench is a benchmark automatically derived from embodied 3D scenes designed for rigorous evaluation of spatial understanding capabilities in LVLMs (Du et al., 2024). Scene data sources include Matterport3D (MP3D), AI2-THOR, and ScanNet, with test and validation sets sampled across 277 scenes and 294 object categories. RGB-D frames are captured from randomly sampled camera poses. Object annotation utilizes camera intrinsics/extrinsics to project 3D coordinates onto image-plane bounding boxes; only non-overlapping object pairs (and sets for “close”/“far” relations) are considered.

Six egocentric spatial relationships are formally defined:

  • “O₁ is left of O₂”: maxx(B1)<minx(B2)max_x(B_1) < min_x(B_2), equivalently x2x1>Δx0x_2 - x_1 > \Delta x_0
  • “O₁ is right of O₂”: minx(B1)>maxx(B2)min_x(B_1) > max_x(B_2), equivalently x1x2>Δx0x_1 - x_2 > \Delta x_0
  • “O₁ is above O₂”: maxy(B1)<miny(B2)max_y(B_1) < min_y(B_2), equivalently y2y1>Δy0y_2 - y_1 > \Delta y_0
  • “O₁ is below O₂”: miny(B1)>maxy(B2)min_y(B_1) > max_y(B_2), equivalently y1y2>Δy0y_1 - y_2 > \Delta y_0
  • “O₁ is close” (among set S): d1=minjS(dj)d_1 = min_{j \in S}(d_j)
  • “O₁ is far” (among set S): d1=maxjS(dj)d_1 = max_{j \in S}(d_j)

Thresholds Δx0\Delta x_0, Δy0\Delta y_0 are theoretically used to avoid ambiguous overlaps but in practice omitted due to strict non-overlapping constraints.

2. Dataset Composition and Automated Benchmark Construction

The automated annotation pipeline yields 3,640 QA pairs across 2,181 unique images utilizing 294 categories sampled from three indoor 3D sources. Object pairs and sets are algorithmically derived, ensuring exhaustive coverage of egocentric spatial relations without manual annotation. This systematic construction establishes EmbSpatial-Bench as the first large-scale, scene-grounded benchmark for embodied spatial understanding (Du et al., 2024).

3. Evaluation Methodology and Performance Metrics

LVLMs are evaluated via multiple-choice queries generated in five template variants per geometric relation. For “close/far”, N-way choices require identifying the extremal object by average depth. Two answer extraction strategies are implemented:

  • Generation: Parsing the LVLM’s text output for the explicit chosen option
  • Likelihood: Computing Pθ(civ,q)P_\theta(c_i|v, q) for all candidates and selecting argmaxarg\max

Primary metric is accuracy:

Acc=1Ni=1N1[y^i=yi]Acc = \frac{1}{N} \sum_{i=1}^{N} \mathbf{1}[\hat{y}_i = y_i]

Additional metrics (precision, recall, F1F_1) are available for fine-grained breakdown per spatial relation, but the benchmark’s principal scoring is correct-option counting.

4. Experimental Findings and Instruction Tuning

Zero-shot evaluation reveals substantial shortcomings in embodied spatial reasoning:

Model Overall Accuracy (Generation)
Qwen-VL-Max 49.11%
GPT-4V 36.07%
InstructBLIP 38.85%
BLIP2 37.99%
Lynx 29.09%
MiniGPT-v2 23.93%
Human Ceiling 90.33%

Qualitative failure analysis (e.g., GPT-4V) exposes persistent bottlenecks in both object localization and spatial relation judgment. Depth-based relations (“close”/“far”) remain most challenging, with likelihood strategy yielding only 20–30% accuracy versus 80–90% for geometric (“above/below”) relations.

Instruction tuning is operationalized via EmbSpatial-SFT, a training dataset of 25K QA pairs (spatial relation identification) plus object-localization grounding tasks. Fine-tuning MiniGPT-v2 using LoRA adapters and cross-entropy loss on answer tokens boosts zero-shot accuracy (generation: 23.93% → 32.97%, likelihood: 43.85% → 78.10%). Auxiliary localization data provides additional incremental gains, and LoRA ablation demonstrates negligible effect when LLM backbone is frozen.

5. Physical Diagnostics Platform: EmbSpatial-Bench (LEEx-B)

The physical EmbSpatial-Bench (formerly LEEx-B) at IPHC-CNRS Strasbourg is a low-energy ion beamline for spatial and emittance diagnostics, featuring a modular architecture, a Cs⁺ ion gun on a high-voltage platform, and a suite of detectors (motorized wire-grid beam profiler, Allison-type emittance meter, Faraday cups) (Bouquerel et al., 2024). Beam axis alignment rails provide precise positioning, and mechanical fiducials ensure sub-mm accuracy.

Ion gun (HeatWave HWIG-250) delivers beams up to 25 keV, typically at currents down to 1 nA. HV and safety circuits utilize FUG HCN 35 power supply and isolation transformers; operational interlocks protect users. Vacuum system achieves base pressure of 10610^{-6}10710^{-7} mbar via scroll and turbo-molecular pumps.

Diagnostics are calibrated using reference Faraday cups and background uniformity correction. Real-time display and EPICS-integrated control software provide on-the-fly computation of beam profile I(x)I(x) and phase-space maps I(x,x)I(x,x'), along with derived emittance (ϵrms\epsilon_{rms}, ϵn\epsilon_n) and Twiss parameters (α,β,γ\alpha, \beta, \gamma).

Emittance theory follows

ϵrms=x2x2xx2\epsilon_{rms} = \sqrt{\langle x^2\rangle \langle x'^2\rangle - \langle x x'\rangle^2}

and normalization by relativistic factors. Recent results demonstrate high sensitivity (10 pA current floor, 0.01 mrad angular resolution) and normalized emittance measurements (ϵn0.06mmmrad\epsilon_n \approx 0.06\,\text{mm}\cdot\text{mrad} at 25 kV).

6. Limitations, Challenges, and Future Directions

Current LVLM architectures show poor egocentric spatial reasoning, particularly for depth-based relations, fundamentally due to reliance on 2D image-text pre-training with inadequate 3D geometry. Instruction compliance, especially on multi-choice outputs, is insufficient. Benchmark expansion into outdoor, multilingual, and more diverse annotation-light datasets is necessary. Incorporating explicit depth cues or 3D modules in future LVLMs is recommended.

In physical diagnostics, EmbSpatial-Bench’s modularity and sensitivity support future calibration upgrades and student training. Expansion to broader beam types and advanced diagnostic devices is plausible.

EmbSpatial-Bench establishes a new standard for embodied spatial reasoning evaluation, uncovering substantial gaps in state-of-the-art LVLMs—even those as advanced as GPT-4V (Du et al., 2024). The instruction-tuned EmbSpatial-SFT dataset demonstrates the efficacy of targeted fine-tuning for spatial tasks. The physical bench complements this by providing ground-truth spatial and emittance diagnostics, supporting validation pipelines for embodied scene simulators. This dual focus advances both algorithmic benchmarking and hardware measurement, serving the needs of spatial cognition research, robotics, and diagnostic instrumentation.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to EmbSpatial-Bench.