Papers
Topics
Authors
Recent
2000 character limit reached

GroundedAR-Bench: AR Evaluation Framework

Updated 6 December 2025
  • GroundedAR-Bench is an evaluation framework that rigorously assesses AR agents’ language-driven spatial localization and relation grounding using real-world AR data.
  • It defines structured tasks—3D localization, relational grounding, and measurement—with precise geometric and semantic metrics for robust performance evaluation.
  • The framework integrates multimodal data from devices like the Meta Quest 3 to benchmark both baseline and hybrid models under realistic sensing conditions.

GroundedAR-Bench is an evaluation framework designed to rigorously assess augmented reality (AR) agents’ capacity for language-driven spatial localization and relation grounding in real physical environments. Developed in response to the limitations of 2D-centric vision-LLM (VLM) evaluations and monolithic 3D reconstruction benchmarks, GroundedAR-Bench provides a comprehensive suite of tasks, a real-world AR dataset with synchronized multimodal streams, and specialized metrics for quantifying both geometric and semantic reasoning success under realistic headset sensing conditions (Guo et al., 29 Nov 2025). The benchmark is intended for the zero-shot evaluation of modular AR agents that integrate multimodal LLMs (MLLMs) with grounded vision and spatial perception tools, facilitating reproducible research across a diverse set of practical AR scenarios.

1. Framework Goals and Structure

GroundedAR-Bench aims to close the methodological gap in AR agent evaluation by focusing on three core capacities: (a) object localization in metric 3D space from open-vocabulary language queries, (b) inference and grounding of typed object–object relations, and (c) execution and validation of higher-level measurement and tool-based tasks. The benchmark comprises four main task categories:

  • T1: 3D Localization Accuracy — evaluating meter-accurate object anchoring and bounding volume estimation.
  • T2: Relation Grounding and Scene-Graph Accuracy — assessing directed, typed edge prediction across nine relation types.
  • T3: End-to-End Language-Guided Spatial Retrieval — examining the agent’s performance on complex spatial queries and measurements.
  • T4: Ablation and Latency Analysis — decomposing end-to-end latency, pipeline variants, and sensor utilization efficacy.

Each task is precisely defined, with both geometric and semantic criteria for success, allowing for systematic ablation and comparison of spatial intelligence modules and agent architectures (Guo et al., 29 Nov 2025).

2. Dataset Composition and Diversity

GroundedAR-Bench deploys a dataset of handheld AR captures acquired from the Meta Quest 3 headset, featuring synchronized RGB images, depth maps, and head pose information. The dataset comprises 214 scenes and 1,736 annotated object instances, spanning three primary environments:

  • Industrial workbenches (tools, mechanical parts)
  • Assistive contexts (kitchens, living rooms)
  • Desk/office setups (laptops, peripherals)

Each environment includes both tidy scenes (low occlusion, few objects) and cluttered arrangements (dense occlusion, numerous objects). Annotation encompasses 2D bounding boxes, 3D object centers (with calibrated markers), categorical labels, and directed scene graphs according to a taxonomy comprising spatial, functional, causal, semantic, and sequential relations. The query set comprises 1,120 natural language prompts covering identification, spatial filtering, relational reasoning, measurement, and navigation (Guo et al., 29 Nov 2025).

Diversity factors are captured by object size/shape variance (from mugs to monitors), occlusion patterns, surface geometry (tables, shelves, uneven workbenches), and sensor artefacts typical of commodity AR devices (lighting conditions, depth noise).

3. Formal Task Definitions and Mathematical Metrics

GroundedAR-Bench specifies rigorous metrics—often in LaTeX notation—for evaluating each agent output against ground truth.

  • Object Localization (T1):
    • 3D position error: derr=p^p2d_{\mathrm{err}} = \|\hat p - p^*\|_2
    • Angular error: θerr=arccos(p^pp^p)\theta_{\mathrm{err}} = \arccos\bigl(\frac{\hat p\cdot p^*}{\|\hat p\|\|p^*\|}\bigr)
    • Success@τ: S@τ=1Ni=1N1(derr,iτ)\mathrm{S@}\tau = \frac1N\sum_{i=1}^N\mathbf{1}(d_{\mathrm{err},i}\le\tau)
    • 2D IoU: IoU=area(b^b)area(b^b)\mathrm{IoU} = \frac{\mathrm{area}(\hat b\cap b^*)}{\mathrm{area}(\hat b\cup b^*)}
  • Relation Grounding (T2):
    • Edge precision: P=#TP#TP+#FPP = \frac{\#\mathrm{TP}}{\#\mathrm{TP}+\#\mathrm{FP}}
    • Edge recall: R=#TP#TP+#FNR = \frac{\#\mathrm{TP}}{\#\mathrm{TP}+\#\mathrm{FN}}
    • F1 score: F1=2PRP+R\mathrm{F1}=2\,\frac{P\,R}{P+R}
    • Relation-type accuracy: Accrel=1Ncorrk=1Ncorr1(r^k=rk)\mathrm{Acc}_{\mathrm{rel}} = \frac1{N_{\mathrm{corr}}}\sum_{k=1}^{N_{\mathrm{corr}}}\mathbf{1}(\hat r_k=r^*_k)
    • Relational-query success: Binary match to ground-truth answer set
  • Measurement/Tool Tasks (T3):
    • Mean absolute error: MAE=1Mi=1Mdpred,idgt,i\mathrm{MAE}=\frac1M\sum_{i=1}^M|d_{\mathrm{pred},i}-d_{\mathrm{gt},i}|
    • Median absolute error
    • Task success rate within fixed tolerances
  • Ablation/Latency (T4):
    • Pipeline steps: capture/encode, network transfer, MLLM inference, detection, depth-based versus planar 2D→3D grounding, overlay (Guo et al., 29 Nov 2025).

4. Protocols for Evaluation and Validation

The benchmark operates in a zero-shot paradigm: all scenes are held out for test—there is no train/test split—and agent baselines access only the inputs, never the ground-truth annotations, at evaluation time. Annotator-verified queries and relational ground-truth are templated and cross-checked for unambiguity. Metrics are aggregated by environment type, scene difficulty, and standardized thresholds (Success@10 cm, Success@20 cm). Human-in-the-loop mechanisms are embedded in the protocol to ensure reproducibility and clarity of query interpretation, particularly for relational and measurement tasks (Guo et al., 29 Nov 2025).

5. Baseline Methods and Comparative Results

GroundedAR-Bench establishes quantitative benchmarks across canonical agent and pipeline variants:

Method T1 3D Err (cm) S@10 (%) T2 Edge F1 T2 Rel-Q (%) T3 Succ (%)
2D VLM baseline 24.1 ± 9.6 36.8 0.60 64.1 65.9
No-Depth (planar lifting) 11.3 ± 5.8 64.1 0.63 74.8 80.1
Words into World (hybrid) 5.4 ± 2.1 88.7 0.79 81.3 90.6
  • The hybrid Words into World agent achieves mean 3D positional error of ≈5 cm and Success@10 cm of ≈89%.
  • Edge F1 (T2) rises to 0.79 for geometric–semantic hybrid inference (versus 0.60–0.63 for baselines).
  • End-to-end task success (T3) approaches 91% for identification, relational, and measurement queries (Guo et al., 29 Nov 2025).

Depth-based 2D→3D grounding proves essential for centimeter-level localization; planar heuristics yield substantial degradation. Task-adaptive agent coordination reduces median latency from ≈6.6 s (fully serial pipeline) to ≈4.7 s without accuracy loss. Clutter and occlusion remain significant challenges for 2D-centric baselines, with performance drop especially notable in complex scenes; the hybrid agent maintains robust accuracy under these conditions.

6. Integration with LaMAR and AR-Specific Ground Truth

GroundedAR-Bench incorporates lessons from LaMAR (Sarlin et al., 2022), which establishes the need for heterogeneous device capture (factory-calibrated multi-camera rigs, hand-helds), sensor fusion (RGB, ToF/lidar, IMU, radio), large-scale, diverse scene coverage with day/night and structural changes, and an automated, laser-aided ground-truth pipeline yielding sub-5 cm and sub-1° pose accuracy. Core metrics such as Absolute Trajectory Error (ATE), Relative Pose Error (RPE), and AR-centric criteria (tight pose thresholds, visual overlap, time-to-recall) are adopted as foundational elements.

LaMAR’s multi-stage ground-truth alignment, mesh-based bundle adjustment, and visual overlap estimation provide reference implementations for persistent and stable AR anchoring. Recommendations for GroundedAR-Bench include open data release, pipeline transparency, and public leaderboards to track continual progress toward AR-grade spatial intelligence (Sarlin et al., 2022).

7. Current Challenges and Future Directions

Key findings highlight depth-based grounding, geometric–semantic relation reasoning, and adaptive orchestration as critical drivers of accuracy and responsiveness. Persistent challenges include:

  • Dynamic scenes (user/object motion), depth holes due to reflective/transparent materials
  • Robustness to lighting and sensor noise
  • Resolving ambiguity in user-driven disambiguation and interaction workflows

A plausible implication is further research into fusing MLLMs with physics-aware perception tools, enhancing scene graph taxonomies, and leveraging multi-device, cross-modal data for increased generalizability. The expansion of benchmark datasets, inclusion of outdoor and structurally variant environments, and real-time evaluation protocols are anticipated to drive progress in real-world AR agent intelligence (Guo et al., 29 Nov 2025, Sarlin et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to GroundedAR-Bench.