Omni3D-Bench: 3D Scene Understanding
- Omni3D-Bench is a comprehensive suite enabling large-scale 3D scene understanding and omnidirectional reconstruction from multi-view 360° imagery.
- It integrates Omni3D for 3D object detection across diverse domains and OB3D for synthetic omnidirectional scene reconstruction and precise camera calibration.
- Its diverse datasets and rigorous evaluation protocols drive advancements in robust cross-domain 3D vision and distortion-compensating reconstructions.
Omni3D-Bench refers collectively to the suite of datasets, benchmarks, and models designed for large-scale 3D scene understanding and reconstruction, emphasizing both 3D object detection in the wild and omnidirectional 3D reconstruction from multi-view 360° imagery. The two most prominent instances are Omni3D—a benchmark for large-scale 3D object detection covering diverse domains and camera types—and OB3D—a synthetic evaluation suite focused on omnidirectional scene reconstruction, camera calibration, and photorealistic synthesis. Each pushes the boundaries of scale, annotation protocol, modality, and quantification for robust, cross-domain 3D vision and novel-view synthesis tasks.
1. Dataset Composition and Scope
Omni3D integrates data across indoor, outdoor, and object-centric domains. Source repositories include SUN RGB-D (10k images), ARKitScenes (~61k), Hypersim (~75k), Objectron (~47k), KITTI (7k), and nuScenes (34k), aggregated into 234,152 unique images containing over 3 million 3D bounding boxes annotated across 98 unified categories (Brazil et al., 2022). All imagery and 3D labels are re-projected to a common camera-centric convention (+x right, +y down, +z forward), harmonizing intrinsics and class vocabularies to maximize applicability.
OB3D leverages synthetic scene generation in Blender, with 12 complex projects (5 indoor, 7 outdoor). Scenes range from architecturally rich indoor spaces to expansive outdoor plazas, temples, and foliage-heavy environments. Data acquisition is performed with a virtual Python-driven equirectangular camera traversing both egocentric (spiral) and non-egocentric (spline-interpolated) paths, yielding 200 omnidirectional images per scene (100 per trajectory type). Pixel-aligned depth, RGB, normals, and sparse 3D point clouds are produced at each pose (1600×800px), with ground-truth camera intrinsics/extrinsics recorded for quantitative benchmarking (Ito et al., 26 May 2025).
| Dataset | Domains | Images | Modalities |
|---|---|---|---|
| Omni3D | Indoor, outdoor, object-centric | 234,152 | RGB, 3D boxes, camera params |
| OB3D | Synthetic indoor/outdoor | 2,400* | RGB, depth, normals, camera params, sparse SfM |
*OB3D image count computed as 12 scenes × 2 trajectories × 100 views.
2. Annotation Protocols and Camera Modeling
Omni3D benchmarks employ a detailed annotation strategy for 3D object detection. Each object is parameterized by its 3D center (X,Y,Z), physical dimensions (w,h,l), and allocentric rotation via a continuous 6D vector (Brazil et al., 2022). These values are merged category-wise, mapped to unified coordinate systems, and support precise camera-intrinsic normalization using virtual depth transforms.
OB3D specifies omnidirectional equirectangular camera intrinsics with and , focal length , principal point , and zero distortion coefficients (Ito et al., 26 May 2025). The spherical pixel-to-direction mapping is defined for pixel by:
- yielding .
All ground truth in OB3D is pixel-aligned across all modalities, facilitating direct correspondence for multi-modal supervised learning and error analysis.
3. Benchmarking Metrics and Evaluation Protocols
Omni3D evaluates 3D detection using AP, calculated over IoU thresholds and stratified by object depth: near (m), medium (m), far (m). Domain-wise metrics are reported for indoor (SUN RGB-D, ARKit, Hypersim), outdoor (KITTI, nuScenes), and the combined test set. The batched C++/CUDA 3D IoU kernel is used for evaluation, excluding heavily truncated or occluded instances.
OB3D quantifies reconstruction, synthesis, and camera estimation with fine-grained measures (Ito et al., 26 May 2025):
- Camera Estimation: Relative rotation error (RRE), relative translation angle (RTE), AUC@5°, Absolute Trajectory Error (ATE).
- Novel View Synthesis: Peak Signal-to-Noise Ratio (PSNR), Structural SIMilarity (SSIM), LPIPS with AlexNet and VGG.
- 3D Reconstruction: RMSE and MAE on depth, AbsRel, δ, and surface normal angular error.
| Metric | Description | Scope |
|---|---|---|
| AP | Average Precision over 3D IoU | 3D object detection |
| PSNR/SSIM/LPIPS | Fidelity of novel view synthesis | Image rendering |
| RRE/RTE/ATE | Camera trajectory evaluation | Camera calibration |
| RMSE/AbsRel/δ | Depth, normal accuracy | 3D reconstruction |
4. Model Architectures and Baselines
Cube R-CNN is Omni3D’s main model, built atop DLA-34 + FPN and implemented in Detectron2 (Brazil et al., 2022). Its RPN is enhanced with “IoUness” (2D IoU regression per anchor) and outputs RoI-pooled feature representations, regressing full 3D box parameters, rotation, and learned uncertainty. Depth predictions are made invariant to camera intrinsics using a virtual depth transform, allowing for robust cross-domain generalization and effective scale augmentation.
OB3D provides results for radiance-field and 3DGS-based models in novel-view synthesis and scene reconstruction:
- EgoNeRF, OmniGS, ODGS, and op43dgs are compared over both indoor egocentric and outdoor non-egocentric captures (Ito et al., 26 May 2025).
- Camera estimation is benchmarked with OpenSfM and OpenMVG; 3D reconstruction is assessed for COLMAP (cube maps), NeuS*, and OmniSDF.
Significant performance variability is observed by environment and trajectory: for example, 3DGS methods show 4–30% swings when switching context, and NeRF-based reconstructions benefit from wide-baseline, non-egocentric poses (suggesting baseline diversity aids scene coverage and geometric reasoning).
5. Challenges, Limitations, and Open Problems
Omni3D’s primary challenge arises from tremendous diversity in imaging hardware, resolutions, and camera intrinsics—necessitating robust, domain-agnostic representation learning and invariant feature engineering.
OB3D’s challenges are specific to omnidirectional imagery:
- Severe equirectangular distortion, most pronounced at image poles, induces latitude-varying stretching and non-uniform pixel-to-solid-angle mapping.
- Standard multi-view stereo and NeRF protocols require nontrivial adaptation, e.g., cube-mapping or custom ray generation, to accommodate non-pinhole camera models.
- OB3D’s fully synthetic nature omits real-world lens aberrations, sensor noise, and material subtleties—limiting generalization for real-world deployment.
Both benchmarks highlight the open problem of unified mesh-mesh distance metrics and call for reconstruction/rendering algorithms that explicitly model omnidirectional distortions. Joint optimization of camera models and neural radiance fields under these conditions is designated a major future direction (Ito et al., 26 May 2025).
6. Impact and Future Directions
Omni3D and OB3D provide comprehensive annotation, diverse modalities, and rigorous evaluation for cross-domain 3D vision tasks. Omni3D’s scale, unified class vocabulary, and virtual-depth formulation make it a strong pre-training corpus, accelerating convergence and yielding high performance in few-shot settings (Brazil et al., 2022). OB3D’s controlled synthetic scenes and rich ground truth facilitate granular analysis and drive algorithmic improvements for omnidirectional reconstruction and view synthesis.
Planned extensions include:
- Integration of real-world omnidirectional imagery in OB3D, with corresponding depth/normal ground truth and explicit lens distortion modeling.
- Expanded evaluation protocols toward full-scene mesh accuracy and omnidirectional radiance-field learning.
- Continued research into distortion-compensating reconstructions and mesh-mesh similarity for partial spherical views.
Omni3D-Bench thereby establishes new standards for scale, diversity, and evaluation rigor in both object-centric and scene-centric 3D vision, serving as a foundation for future algorithmic development and comprehensive, reproducible benchmarking.