OmniObject3D: 3D Object Dataset
- OmniObject3D is a large-scale, diverse dataset of 3D objects that provides detailed scans, textured meshes, multi-view images, and 360° videos for comprehensive 3D analysis.
- Its OO3D-9D extension enhances the core dataset with per-frame photorealistic RGB-D renders and precise 9D pose, size, and symmetry annotations for robust category-level pose estimation.
- Implicit-Zoo transforms OmniObject3D objects into neural implicit representations, enabling effective 3D pose regression and fostering cross-modal learning in computer vision.
The OmniObject3D dataset and its derivatives represent a foundational resource for research in 3D object recognition, reconstruction, and generative modeling in computer vision, robotics, and graphics. The core dataset and its major extensions—most notably OO3D-9D and its inclusion in Implicit-Zoo—enable large-scale and diverse benchmarking for tasks ranging from category-level object pose estimation to 3D neural implicit representation learning.
1. Definition and Scope
OmniObject3D is a large-scale, professionally scanned dataset comprising 6,000 distinct real-world objects spanning 190 everyday categories, each aligned to canonical orientations and taxonomy standards adopted from major 2D datasets such as ImageNet and LVIS (Wu et al., 2023). Objects are scanned using high-fidelity sensors, with the data modalities encompassing textured 3D meshes, multiview photorealistic renders, point clouds, and 360° object-centered videos (with aligned camera parameters), supporting multimodal and cross-modal research. With fine-grained category labeling (mean ≈30 objects per class), OmniObject3D is constructed to facilitate exploration of robustness, generalization, and cross-modality in large-vocabulary 3D vision.
Derivatives such as OO3D-9D (Cai et al., 19 Mar 2024) and its use in Implicit-Zoo (Ma et al., 25 Jun 2024) significantly expand the dataset's utility: OO3D-9D augments OmniObject3D with per-frame photorealistic RGB-D renderings, ground-truth 9D pose and size annotations, and symmetry metadata for category-level recognition and pose estimation; Implicit-Zoo provides a large-scale collection of neural implicit representations (fitted NeRFs) using OmniObject3D objects as inputs.
2. Data Modalities and Annotation Protocols
Core Data Modalities
| Modality | Properties | Notes |
|---|---|---|
| 3D Mesh | Textured, watertight (when possible), OBJ/PLY/MTL | 50K–2M faces, real scans |
| Point Cloud | 1K–16K points, multi-resolution, Open3D generated | Provided per-mesh |
| Multiview Images | 100 views/object (800×800 px, RGB, depth, normals) | Blender, known extrinsics |
| 360° Videos | 200 frames/object, per-frame COLMAP pose, masks | iPhone 12 footage, scale-calibrated |
Annotation includes per-object canonical orientation and per-category scale normalization, with rigorous quality control (manual inspection and blur/mask filtering retaining >80% of scans).
OO3D-9D Annotation Protocol
OO3D-9D extends OmniObject3D by adding:
- Per-frame single-object and cluttered multi-object photorealistic RGB-D renders (BlenderProc2, with BOP format compatible outputs).
- 9D pose annotation for every object in every image: (, , as size vector).
- Symmetry metadata, categorized into:
- Non-symmetric
- Discrete symmetric (e.g., -fold rotations, given as set of rotation matrices )
- Continuous symmetric (axes specified as )
- BOP-format extensions in
models_info.jsonwith new symmetry fields.
Implicit-Zoo Representation
OmniObject3D objects are converted for Implicit-Zoo by fitting a vanilla NeRF (Ma et al., 25 Jun 2024) to each object:
- Input: 96 center-cropped RGB images at 400×400 px (downsampled from the original 800×800).
- Output: NeRF weights (MLP with four layers, width 128, 5.96 GB for all objects), plus per-view camera parameters.
Scenes failing to reach PSNR 25 dB are excluded, yielding 5,287–5,914 high-quality INR object scenes for downstream tasks.
3. Dataset Scale, Splits, and Benchmarks
OmniObject3D
- 6,000 objects, 190 categories, 10–40 instances per class.
- Four benchmark tracks: robust 3D perception, novel-view synthesis, surface reconstruction, 3D object generation.
- Data aligned to 2D datasets, e.g., shares 85 ImageNet, 130 LVIS classes.
OO3D-9D
- 5,371 unique object instances, 216 categories, 1,000 single-object RGB-D views per instance ( 5.37M frames)
- 100,000 multi-object clutter scenes, 5 camera views each ( 500,000 frames; each frame has 5–20 objects)
- Symmetry instance breakdown: 2,000 non-symmetric, 1,130 discrete-symmetric, 2,065 continuous-symmetric
- Test splits:
- 10 “unseen” categories (230 instances)
- 214 “novel instances” (held-out models within existing categories)
Dataset Comparison Table
$\begin{array}{l|r|r|r|c} \textbf{Dataset} & \#\text{instances} & \#\text{categories} & \#\text{images} & GT\,(T,s)\,? \ \hline \mathrm{CAMERA25} & 184 & 6 & 300,000 & \checkmark \ \mathrm{REAL275} & 24 & 6 & 8,000 & \checkmark \ \mathrm{Wild6D_{train}} & 1,560 & 5 & 1,000,000 & \times \ \mathrm{Wild6D_{test}} & 162 & 5 & 100,000 & \checkmark \ \mathrm{FS6D} & 12,490 & 51 & 800,000 & \checkmark \ \mathbf{OO3D\text{-}9D} & 5,371 & 216 & 5,371,000 & \checkmark \ \end{array}$
Implicit-Zoo
- Initial fits: 5,914 NeRF-scene pairs from 5,998 OmniObject3D objects
- After filtering for PSNR 25 dB: 5,287 released high-quality INR objects
- Storage: 5.96 GB; mean PSNR 31.5 dB, std. dev. 3.87 dB
- GPU cost: 70 days (RTX-2080)
4. Evaluation Protocols and Metrics
OO3D-9D: 9D Object Pose and Size Estimation
Task: Given a scene image and a text description , predict position, orientation, and size for arbitrary novel objects.
Supervision: Per-pixel SmoothL1 loss between predicted and ground-truth NOCS maps, with symmetry-augmented loss minimization.
- NOCS normalization:
- Symmetry handling:
- Discrete symmetry: supervise over set of equivalent ; select -minimal assignment per training pair.
- Continuous symmetry: sample small rotations about axis for approximate equivalence.
Pose/Size Recovery
Umeyama algorithm with RANSAC is used to solve for from 3D-3D correspondences:
Evaluation Metrics
- 3D IoU@50: Fraction of predictions with axis-aligned IoU
- Absolute rotation–translation precision:
- Relative pose consistency:
Performance on unseen-category split for OV9D backbone:
- abs IoU@50 = 87.8%
- abs 5° 5 cm = 15.8%
- rel 5° 5 cm = 26.8%
- PCA baseline 53%; close-set SOTA (IST-Net): 4.5% on relative 5° 5 cm
Implicit-Zoo: 3D Pose Regression from Neural Implicit Representations
Task: Given a color image of one of the 5,287 released OmniObject3D scenes and the corresponding NeRF , regress the unknown camera 6DoF pose .
Pipeline:
- Sample NeRF grid (), tokenize via 3D convolution or learnable tokenizer
- ViT backbone (12L, 3H, embed dim 192) fuses 3D/2D tokens, outputs coarse pose
- Further refinement by minimizing NeRF-based photometric loss
Metrics:
- Translation Error (TE, cm)
- Rotation Error (RE, deg), % of frames with RE β°
- Best performance on “seen” scenes: TE=3.12 cm, RE=14.59°, , , RE+Ref=4.22°
- On “unseen” scenes: TE=5.99 cm, RE=20.02°, , , RE+Ref=8.09°
This constitutes the first large-scale, generalizable benchmark for 3D pose regression from neural implicit functions.
5. Distinguishing Features and Advances
Compared to prior datasets:
- Scale: OO3D-9D offers the largest category count (216) and instance count (5,371) with full 9D pose-size ground truth, surpassing prior datasets by at least an order of magnitude in combined scope and diversity.
- Category symmetry: Systematic per-instance symmetry annotation (both discrete and continuous), directly encoded in BOP format, enables symmetry-aware training, loss, and evaluation workflows not previously standardized in category-level pose datasets.
- Photorealistic rendering: Large-scale multi-view image synthesis and domain randomization contribute to high scene diversity.
- Neural implicit integration: Implicit-Zoo’s use of OmniObject3D establishes the first large-scale library of per-object NeRFs for generic objects, facilitating research into implicit 3D understanding, pose regression, and tokenization strategies.
6. Practical Availability and Usage
OmniObject3D and its derivatives are released for research under licenses restricting use to non-commercial purposes. The official project site provides dataset downloads, documentation, and scripts for benchmarking across the various tasks (https://omniobject3d.github.io/).
- OO3D-9D: BOP-format images, masks, depth, ground-truth pose/size, and symmetry metadata enable drop-in use for object pose, category-level recognition, and open-vocabulary 3D understanding tasks.
- Implicit-Zoo: Released as MLP-weight dumps, with scripts for tokenization and downstream evaluation; does not include original meshes, but allows point cloud recovery by density queries.
- Processing pipelines: BlenderProc2 for rendering, Open3D for point sampling, and COLMAP for video camera trajectory recovery are all integrated into dataset construction.
7. Research Impact and Current Limitations
OmniObject3D and its OO3D-9D and Implicit-Zoo derivatives have established benchmark standards and large-scale, high-fidelity resources for studying category-level 3D perception, open-vocabulary recognition, and implicit 3D representation learning. Notable outcomes include:
- Disentangling category symmetry in pose recovery and annotation pipelines.
- Advancing open-vocabulary pose estimation with text-conditioned models leveraging synthetic photorealistic data and strong visual-language priors.
- Enabling token-based transformer models for 3D understanding via neural implicit volumetric representations.
Limitations observed include semantic imbalance among categories, generation bias in mesh/texture synthesis, challenges in pose and shape recovery for highly concave or low-texture shapes, and limited robustness on sparse-view geometric and photometric benchmarks—a plausible implication is the need for further research into shape-texture disentanglement and cross-category generalization.
Further advances are anticipated in large-scale category-level 3D learning, robust implicit modeling, and the integration of textual and visual priors for open-set, real-world object understanding.