E3D-Bench: 3D Geometric Model Evaluation

Updated 3 December 2025

E3D-Bench is a unified benchmark suite for evaluating 3D geometric foundation models, enabling direct inference of depth, point clouds, and camera parameters.
It standardizes data handling, metric computation, and protocol enforcement across tasks like sparse-view depth estimation, 3D reconstruction, pose estimation, and view synthesis.
The platform facilitates fair comparisons between feed-forward and diffusion-based architectures, highlighting trade-offs in efficiency and domain generalization.

E3D-Bench is a unified, end-to-end benchmark suite for the evaluation of 3D Geometric Foundation Models (GFMs), focusing on direct and efficient inference of core 3D representations—depth, point clouds, camera parameters, and synthesized novel views—from unposed and unstructured visual input. E3D-Bench, established first for event-based 3D reconstruction (Baudron et al., 2020) and subsequently extended to RGB-based end-to-end GFMs (Cong et al., 2 Jun 2025), aims to replace brittle, multi-stage geometric pipelines with a standardized evaluation of single-pass feed-forward and diffusion architectures on a broad spectrum of 3D perception tasks and datasets, including out-of-distribution regimes. The platform emphasizes fair, reproducible comparison by automating data handling, metric computation, and protocol enforcement, thereby catalyzing the development and deployment of real-time, robust 3D spatial intelligence models.

1. Historical Context and Motivation

Conventional 3D reconstruction and perception pipelines rely on Structure-from-Motion or dense SLAM, requiring cascaded modules for feature detection, matching, pose graph optimization, and depth fusion. These approaches are prone to latency and brittleness, especially in the presence of limited overlap, dynamic or adverse environmental conditions, and computational budgets common in edge or robotics platforms. The emergence of end-to-end 3D foundation models, inspired by foundation models in NLP and 2D vision, marks a critical paradigm shift: input images (pairs, sequences, or multi-views) are directly mapped to dense 3D outputs without reliance on precomputed camera parameters or expensive keypoint matching.

E3D-Bench addresses the resultant lack of systematic evaluation protocols for these models, offering the first comprehensive suite spanning core 3D tasks, diverse domains, and challenging out-of-distribution conditions, with the further capacity to handle event-based and RGB-based data (Baudron et al., 2020, Cong et al., 2 Jun 2025).

2. Core Tasks and Benchmark Coverage

E3D-Bench systematically evaluates GFMs over five representative tasks to probe both geometric accuracy and readiness for real-world deployment:

Sparse-View Depth Estimation: Given 2–5 unposed RGB images, models infer per-pixel depth under extremely limited multi-view geometric constraints.
Video Depth Estimation: Models estimate temporally consistent depth maps from monocular RGB videos (30+ frames), maintaining temporal coherence amidst motion and occlusion.
Multi-View 3D Reconstruction: From either extremely sparse (2–5 images) or dense (10–50 images) unposed views, models produce globally aligned 3D point clouds, accommodating both object- and scene-scale geometry.
Multi-View Relative Pose Estimation: The task requires inferring relative camera trajectories up to a global Sim(3) scale from unordered RGB frames (10–40 per sequence), covering indoor, outdoor, aerial, and mixed air-ground scenarios.
Novel View Synthesis: With two unposed source images, models synthesize a novel view at a requested camera pose, testing scene geometry and appearance modeling under both in-domain and cross-domain conditions.

These tasks are selected to rigorously evaluate end-to-end geometric reasoning under minimal supervision, limited view overlap, and diverse data modalities (Cong et al., 2 Jun 2025).

3. Datasets, Protocols, and Metric Definitions

E3D-Bench standardizes data selection, processing, and evaluation to enforce comparability. Datasets include DTU, ETH3D, KITTI, ScanNet, Tanks & Temples for sparse/dense view depth and reconstruction; Bonn, TUM Dynamics, KITTI-VO, Sintel, PointOdyssey, Syndrone for video depth; CO3Dv2, RealEstate10K, ADT, ACID, ULTRRA for multi-view pose estimation and aerial/air-ground scenarios; and RealEstate10K, ScanNet++, ACID for novel view synthesis. All subsets are processed by toolkit scripts for consistent frame selection, input resizing, and view arrangement.

Metric computations are fully specified and vectorized:

Task	Metrics	Key Formulas/Criteria
Depth Estimation (all)	AbsRel, RMSE, $\delta(\tau)$	$\displaystyle \mathrm{AbsRel} = \frac{1}{N} \sum_{i=1}^N \frac{\|d_i-\hat{d}_i\|}{\hat{d}_i}$ , $\delta(\tau) = \#\{i: \max(d_i/\hat d_i, \hat d_i/d_i)<\tau\}/N$
3D Reconstruction	Acc, Comp, NC	$\displaystyle \mathrm{Acc} = \frac{1}{\|P\|}\sum_{p\in P}\min_{q\in P^}\\|p-q\\|$ , $\displaystyle \mathrm{Comp} = \frac{1}{\|P^\|}\sum_{q\in P^*}\min_{p\in P}\\|p-q\\|$ , NC as mean normal dot-product
Pose Estimation	ATE, RPE (trans/rot)	$\displaystyle \mathrm{ATE} = \frac{1}{N}\sum_{i=1}^N\\|t_i-\hat t_i\\|$ , RPE as trajectory difference
View Synthesis	PSNR, SSIM, LPIPS	PSNR (dB), SSIM ( $[0,1]$ ), LPIPS (lower is better)

All metric implementations are open-sourced, and per-sequence or per-scene aggregation is used (Cong et al., 2 Jun 2025). In event-based E3D-Bench, silhouettes, Chamfer distances, and pose errors supplement visual tasks (Baudron et al., 2020).

4. Toolkit Architecture and Workflow

E3D-Bench introduces a modular toolkit comprising:

data/: Dataset downloaders, preprocessors, and view-selection scripts.
models/: Inference wrappers for 16 GFMs, abstracting input normalization, frame padding, and model-specific post-processing (global alignment, median scaling).
metrics/: Unified, auto-vectorized routines for every benchmark metric, enforcing correct experimental conditions.
configs/: Declarative task and dataset configurations via YAML files.

A single command launches specified task-benchmarks, orchestrating data loading, batch inference, metric evaluation, and result serialization (JSON, CSV), with optional result plots (Cong et al., 2 Jun 2025).

5. Model Coverage and Comparative Results

E3D-Bench currently benchmarks 16 GFMs, spanning:

Pair-view Models: DUSt3R, MASt3R, LSM, MonST3R, NoPoSplat, Align3R, Splatt3R, Easi3R
Sequence Models: Spann3R, CUT3R, Aether, Geo4D, GeometryCrafter
Multi-view Models: Fast3R, VGGT
Sparse-view Specialized: FLARE

Select highlights:

Sparse-View Depth: VGGT yields the lowest AbsRel (≈1.1% on DTU) and highest inlier ratio $\delta<1.03$ (≈94%), outperforming pure feed-forward ViT variants on this metric.
Video Depth: VGGT again leads on normalized metrics across synthetic and real domains; diffusion models (Geo4D, Aether) match temporal consistency, but absolute metric depth remains robust only in CUT3R.
Multi-View Pose: Aether (diffusion) achieves best performance on unconstrained dynamic sequences, while MASt3R/DUSt3R lead in static indoor/object settings. Specialized fine-tuning is critical for out-of-distribution air-ground cases.
3D Reconstruction: DUSt3R/LSM and VGGT achieve the lowest object-space errors for extremely sparse input; VGGT leads in dense view fusion.
Novel View Synthesis: On in-domain sets, NoPoSplat and FLARE yield PSNR≈24–25 dB, SSIM≈0.80–0.85, LPIPS≈0.16; significant performance loss (~30%) is observed on OOD appearance domains (Cong et al., 2 Jun 2025).

In event-based reconstruction (Baudron et al., 2020), the E3D pipeline (E2S+Pose, PyTorch3D renderer, 3D→Event simulation) outperforms photometric-mesh-optimization and SLAM baselines when evaluated on ShapeNet Cars and Chairs, robustly handling motion blur and sparse features.

6. Limitations, Insights, and Future Directions

Empirical analysis supplies the following insights (Cong et al., 2 Jun 2025):

Task Difficulty: Two-view depth and pose prediction is tractable; full metric-scale scene reconstruction is substantially harder, with error amplification in global point alignment.
Domain Generalization: GFMs generalize to moderate OOD regimes (street, drone), but struggle under extreme domain shifts, e.g., mixed altitude air-ground tasks.
Architectural Trade-offs: No single architecture dominates—feed-forward ViTs offer speed; diffusion-based denoisers enable refined generative output. Pretrained 2D vision backbones (e.g., DINO-ViT) yield marked gains in 3D accuracy.
Efficiency Constraints: Even efficient online GFMs (Spann3R, CUT3R) require up to tens of seconds to process long sequences (256 views on A100 GPUs), motivating research into model compression, sparse computation, and hybrid pipelines.

E3D-Bench's event-based branch (Baudron et al., 2020) is positioned for extensions in multi-category generalization, real-world pseudo-GT event data, and integration with silhouette, shading, and SDF-based priors, suggesting increased applicability in low-power and high-dynamic-range contexts.

7. Availability and Impact

All code, evaluation scripts, and processed datasets are publicly released with E3D-Bench to foster open, reproducible research. Baseline implementations for both RGB- and event-based 3D reconstruction are supported, together with evaluation utilities for silhouette, pose, Chamfer, and standard 3D scene metrics. E3D-Bench establishes the reference suite for benchmarking modern 3D geometric foundation models, supporting advances in spatial intelligence for robotics, aerial mapping, and extended reality (Cong et al., 2 Jun 2025, Baudron et al., 2020).

PDF Markdown Chat (Pro)

References (2)

E3D: Event-Based 3D Shape Reconstruction (2020)

E3D-Bench: A Benchmark for End-to-End 3D Geometric Foundation Models (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to E3D-Bench.