Papers
Topics
Authors
Recent
2000 character limit reached

Omni3D Benchmark: Monocular 3D Object Detection

Updated 29 November 2025
  • Omni3D Benchmark is a large-scale unified evaluation suite for 3D object detection that consolidates six major datasets with over 3 million 3D annotations.
  • It standardizes camera intrinsics, coordinate conventions, and 3D IoU metrics to enable consistent performance comparisons across indoor, outdoor, and wild scenes.
  • The benchmark has catalyzed advanced model designs and demonstrated significant improvements in zero-shot transfer and cross-domain generalization.

The Omni3D Benchmark is a large-scale standardized evaluation suite for monocular 3D object detection, specifically designed to unify and extend the evaluation of object detection algorithms across broad real-world domains, various camera intrinsics, and many object categories. Developed in the context of the increasing capability and diversity of 2D and 3D visual recognition systems, Omni3D brings together diverse existing datasets, re-annotates them under a single protocol, and establishes a rigorous, multi-faceted evaluation methodology for 3D object definitions and detection in the wild (Brazil et al., 2022). Its emergence has catalyzed a wave of new model developments and benchmarking protocols, serving as a reference point for recent vision-language and foundation models evaluated in a 3D geometry-aware manner (Man et al., 25 Nov 2025).

1. Dataset Construction and Scope

Omni3D re-purposes and unifies six major public monocular 3D detection datasets: KITTI, nuScenes, ARKitScenes, SUN RGB-D, Objectron, and Hypersim. The aggregate result is 234,000 RGB images with over 3 million annotated 3D bounding boxes and 98 raw object categories, with 50 categories containing at least 1,000 samples. The benchmark includes both indoor (e.g., SUN RGB-D, ARKitScenes, Hypersim) and outdoor (e.g., KITTI, nuScenes) scenes, as well as "in the wild" close-ups (Objectron).

All images are re-annotated into a single coordinate convention (+x right, +y down, +z forward), with consistent 2D and 3D box encoding, and camera intrinsics encoded for every instance. Camera parameters vary widely: image heights from ∼370 to 1920 pixels and focal lengths from 500–1700 px, with each image provided with explicit (fx,fy,px,py)(f_x, f_y, p_x, p_y). Train/validation/test splits follow the original set protocols or, where unspecified, respect video-sequence boundaries: 175,573 train, 19,127 validation, and 39,452 test images.

A summary table of key dataset statistics:

Attribute Value
Images 234,000
3D Box Annotations >3,000,000
Object Categories 98 (50 with ≥1k instances)
Train/Val/Test Split 175,573 / 19,127 / 39,452
Source Datasets KITTI, nuScenes, ARKitScenes, SUN RGB-D, Objectron, Hypersim

This size and diversity are 20× greater than KITTI or SUN RGB-D (Brazil et al., 2022).

2. Benchmark Tasks, Protocols, and Metrics

The main Omni3D task is monocular 3D object detection from a single RGB image II and known camera intrinsics K\mathbf K. Each detection BiB_i comprises a 3D cuboid: center Xi∈R3{\bf X}_i\in\mathbb{R}^3, size (wi,hi,ℓi)(w_i,h_i,\ell_i), and orientation Ri∈SO(3)R_i\in SO(3).

The canonical box representation is: B3D(u,v,z,wˉ,hˉ,ℓˉ,p)=R(p) D(wˉ,hˉ,ℓˉ) Bunit+X(u,v,z)B_\text{3D}(u,v,z,\bar w,\bar h,\bar \ell,\mathbf p) = R(\mathbf p)\,D(\bar w,\bar h,\bar \ell)\,B_\text{unit} + \mathbf X(u,v,z) where BunitB_\text{unit} is a unit cube, DD is a diagonal scaling with category mean dimensions, and p∈R6\mathbf p\in\mathbb R^6 is a continuous ("allocentric") rotation vector.

Evaluation uses 3D Intersection-over-Union (IoU), with average precision (AP) reported over a spectrum of relaxed thresholds T={0.05, 0.10, …, 0.50}\mathcal T = \{0.05,\,0.10,\,\dots,\,0.50\}: AP3D=1∣T∣∑τ∈TAP3D(τ)\mathrm{AP}_{3D} = \frac{1}{|\mathcal T|} \sum_{\tau\in\mathcal T} \mathrm{AP}_{3D}(\tau) Additional metrics include depth-stratified scores (near/mid/far zones) and per-category results.

3. Model Baselines and Performance Results

Omni3D serves as the testbed for a range of representative monocular 3D detectors, both legacy and recent. Baselines evaluated include M3D-RPN, SMOKE, FCOS3D, PGD, ImVoxelNet, GUPNet, and the unified Cube R-CNN. On both the outdoor-only (KITTI + nuScenes) and full Omni3D benchmarks, Cube R-CNN reports state-of-the-art scores:

Method Outdoor AP3D_{3D} Full AP3D_{3D}
M3D-RPN 13.7 -
SMOKE 19.5 9.6
FCOS3D 17.6 9.8
PGD 22.9 11.2
ImVoxelNet 21.5 9.4
GUPNet 19.9 -
Cube R-CNN 31.9 23.3

For reference, Cube R-CNN also outperforms prior methods on individual test sets, e.g., SUN RGB-D and KITTI.

Recent works leverage Omni3D to evaluate vision-LLMs with explicit 3D output. LocateAnything3D achieves 49.89 AP3D_3D, an improvement of +15.51 absolute points over the previous best baseline with ground-truth 2D boxes (DetAny3D + GT-2D at 34.38 AP3D_3D), demonstrating the benchmark's role in tracking state-of-the-art (Man et al., 25 Nov 2025).

4. Architectural and Methodological Advances

Omni3D's diversity and explicit annotation protocols drive several key architectural changes in 3D recognition systems. Cube R-CNN is the first unified, end-to-end monocular 3D detector with broad generalization across domains, handling arbitrary camera intrinsics, rotations, and indoor/outdoor scenes via:

  • A backbone of DLA-34 with Feature Pyramid Network, pretrained on ImageNet.
  • An RPN with a learned "IoUness" regressor in place of objectness.
  • A dedicated "cube head" that regresses RoI features to 13 outputs per class: 2D center, virtual depth zvz_v (mapped back to metric zz), size, 6D rotation, and learned uncertainty μ\mu.
  • Loss functions disentangle 3D parameters, include a joint Chamfer loss, and utilize uncertainty-aware weighting.

LocateAnything3D introduces a "Chain-of-Sight" (CoS) factorization within a vision-LLM: 2D object prediction followed by 3D box emission, autoregressively sorted near-to-far and factored into center, dimensions, and rotation. This protocol aligns with how depth cues are acquired, leading to large empirical gains (Man et al., 25 Nov 2025).

5. Pretraining, Transfer, and Zero-Shot Cross-Domain Utility

Omni3D is explicitly designed as a universal pretraining corpus for 3D detection. Pretraining Cube R-CNN on full Omni3D and fine-tuning on a target dataset provides significant transfer improvements:

  • ARKitScenes: 38.6 → 41.2 AP3D_{3D} (+2.6%)
  • KITTI: 37.1 → 42.4 AP3D_{3D} (+5.3%)

In low-shot regimes (5% of data), corresponding fine-tuning from Omni3D recovers over 70% of full-data AP3D_{3D}, sharply ahead of ImageNet-based initialization. Cross-dataset zero-shot (e.g., train on KITTI, test on nuScenes) also shows best-in-class generalization. This universal property is now directly exploited by vision-LLMs such as LocateAnything3D, which show strong zero-shot robustness to held-out categories (Man et al., 25 Nov 2025).

6. Significance, Impact, and Future Outlook

Omni3D is the first large-scale, multi-domain, many-category monocular 3D detection evaluation suite, comparable in impact to COCO for 2D recognition. It supports step-changes in model design, evaluation standardization, and research on generalization to novel domains and long-tailed object classes (Brazil et al., 2022).

Recent developments using Omni3D as a foundation include VLM-native models with open-vocabulary and visual-prompting, directly leveraging its diversity and annotation protocol (Man et al., 25 Nov 2025). The benchmark's protocol for camera intrinsics, coordinate conventions, and 3D IoU is now widely adopted.

This suggests Omni3D will remain central in evaluating and driving advances in both traditional and foundation-style multi-modal 3D recognition pipelines. Continued expansion to cover 3D generative evaluation, open-vocabulary detection, and richer modalities (e.g., multi-view, omnidirectional, and real-scanned 3D) is a plausible implication for successor benchmarks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Omni3D Benchmark.