OmniObject3D: 3D Object Dataset

Updated 9 November 2025

OmniObject3D is a large-scale, diverse dataset of 3D objects that provides detailed scans, textured meshes, multi-view images, and 360° videos for comprehensive 3D analysis.
Its OO3D-9D extension enhances the core dataset with per-frame photorealistic RGB-D renders and precise 9D pose, size, and symmetry annotations for robust category-level pose estimation.
Implicit-Zoo transforms OmniObject3D objects into neural implicit representations, enabling effective 3D pose regression and fostering cross-modal learning in computer vision.

The OmniObject3D dataset and its derivatives represent a foundational resource for research in 3D object recognition, reconstruction, and generative modeling in computer vision, robotics, and graphics. The core dataset and its major extensions—most notably OO3D-9D and its inclusion in Implicit-Zoo—enable large-scale and diverse benchmarking for tasks ranging from category-level object pose estimation to 3D neural implicit representation learning.

1. Definition and Scope

OmniObject3D is a large-scale, professionally scanned dataset comprising 6,000 distinct real-world objects spanning 190 everyday categories, each aligned to canonical orientations and taxonomy standards adopted from major 2D datasets such as ImageNet and LVIS (Wu et al., 2023). Objects are scanned using high-fidelity sensors, with the data modalities encompassing textured 3D meshes, multiview photorealistic renders, point clouds, and 360° object-centered videos (with aligned camera parameters), supporting multimodal and cross-modal research. With fine-grained category labeling (mean ≈30 objects per class), OmniObject3D is constructed to facilitate exploration of robustness, generalization, and cross-modality in large-vocabulary 3D vision.

Derivatives such as OO3D-9D (Cai et al., 19 Mar 2024) and its use in Implicit-Zoo (Ma et al., 25 Jun 2024) significantly expand the dataset's utility: OO3D-9D augments OmniObject3D with per-frame photorealistic RGB-D renderings, ground-truth 9D pose and size annotations, and symmetry metadata for category-level recognition and pose estimation; Implicit-Zoo provides a large-scale collection of neural implicit representations (fitted NeRFs) using OmniObject3D objects as inputs.

2. Data Modalities and Annotation Protocols

Core Data Modalities

Modality	Properties	Notes
3D Mesh	Textured, watertight (when possible), OBJ/PLY/MTL	50K–2M faces, real scans
Point Cloud	1K–16K points, multi-resolution, Open3D generated	Provided per-mesh
Multiview Images	100 views/object (800×800 px, RGB, depth, normals)	Blender, known extrinsics
360° Videos	200 frames/object, per-frame COLMAP pose, masks	iPhone 12 footage, scale-calibrated

Annotation includes per-object canonical orientation and per-category scale normalization, with rigorous quality control (manual inspection and blur/mask filtering retaining >80% of scans).

OO3D-9D Annotation Protocol

OO3D-9D extends OmniObject3D by adding:

Per-frame single-object and cluttered multi-object photorealistic RGB-D renders (BlenderProc2, with BOP format compatible outputs).
9D pose annotation for every object in every image: $T = \begin{bmatrix} R & t \ 0 & 1 \end{bmatrix} \in SE(3)$ ( $R \in SO(3)$ , $t \in \mathbb{R}^3$ , $s \in \mathbb{R}^3$ as size vector).
Symmetry metadata, categorized into:
- Non-symmetric
- Discrete symmetric (e.g., $n$ -fold rotations, given as set of rotation matrices $\{S_i\}$ )
- Continuous symmetric (axes specified as $a \in \mathbb{R}^3$ )
BOP-format extensions in models_info.json with new symmetry fields.

Implicit-Zoo Representation

OmniObject3D objects are converted for Implicit-Zoo by fitting a vanilla NeRF (Ma et al., 25 Jun 2024) to each object:

Input: 96 center-cropped RGB images at 400×400 px (downsampled from the original 800×800).
Output: NeRF weights (MLP with four layers, width 128, $\sim$ 5.96 GB for all objects), plus per-view camera parameters.

Scenes failing to reach PSNR $\geq$ 25 dB are excluded, yielding 5,287–5,914 high-quality INR object scenes for downstream tasks.

3. Dataset Scale, Splits, and Benchmarks

OmniObject3D

6,000 objects, 190 categories, $\sim$ 10–40 instances per class.
Four benchmark tracks: robust 3D perception, novel-view synthesis, surface reconstruction, 3D object generation.
Data aligned to 2D datasets, e.g., shares 85 ImageNet, 130 LVIS classes.

OO3D-9D

5,371 unique object instances, 216 categories, 1,000 single-object RGB-D views per instance ( $\approx$ 5.37M frames)
100,000 multi-object clutter scenes, 5 camera views each ( $\approx$ 500,000 frames; each frame has 5–20 objects)
Symmetry instance breakdown: $\approx$ 2,000 non-symmetric, 1,130 discrete-symmetric, 2,065 continuous-symmetric
Test splits:
- 10 “unseen” categories (230 instances)
- 214 “novel instances” (held-out models within existing categories)

Dataset Comparison Table

$\begin{array}{l|r|r|r|c} \textbf{Dataset} & \#\text{instances} & \#\text{categories} & \#\text{images} & GT\,(T,s)\,? \ \hline \mathrm{CAMERA25} & 184 & 6 & 300,000 & \checkmark \ \mathrm{REAL275} & 24 & 6 & 8,000 & \checkmark \ \mathrm{Wild6D_{train}} & 1,560 & 5 & 1,000,000 & \times \ \mathrm{Wild6D_{test}} & 162 & 5 & 100,000 & \checkmark \ \mathrm{FS6D} & 12,490 & 51 & 800,000 & \checkmark \ \mathbf{OO3D\text{-}9D} & 5,371 & 216 & 5,371,000 & \checkmark \ \end{array}$

Implicit-Zoo

Initial fits: 5,914 NeRF-scene pairs from 5,998 OmniObject3D objects
After filtering for PSNR $\geq$ 25 dB: 5,287 released high-quality INR objects
Storage: 5.96 GB; mean PSNR $\approx$ 31.5 dB, std. dev. 3.87 dB
GPU cost: $\sim$ 70 days (RTX-2080)

4. Evaluation Protocols and Metrics

OO3D-9D: 9D Object Pose and Size Estimation

Task: Given a scene image $I$ and a text description $d$ , predict position, orientation, and size for arbitrary novel objects.

Supervision: Per-pixel SmoothL1 loss between predicted and ground-truth NOCS maps, with symmetry-augmented loss minimization.

NOCS normalization:

$M_{\mathrm{NOCS}}(p) = \frac{1}{s} R^{-1} (p-t) \in [0,1]^3$

Symmetry handling:
- Discrete symmetry: supervise over set of equivalent $S_i \in SO(3)$ ; select $L_1$ -minimal assignment per training pair.
- Continuous symmetry: sample $K=36$ small rotations about axis $a$ for approximate equivalence.

Pose/Size Recovery

Umeyama algorithm with RANSAC is used to solve for $(R, t, s)$ from 3D-3D correspondences:

$\min_{R, t, s} \sum_k \| p_k - (R q_k + t) \|_2^2$

Evaluation Metrics

3D IoU@50: Fraction of predictions with axis-aligned IoU $\geq 0.5$
Absolute rotation–translation precision:

$\mathrm{AP}_{a^\circ, b\,\mathrm{cm}} = \frac{1}{N} \sum_{j=1}^N \mathbf{1}(\Delta R_j < a,\, \Delta t_j < b)$

Relative pose consistency:

$\mathrm{AP}^{\mathrm{rel}}_{a^\circ,b\,\mathrm{cm}} = \frac{1}{N-1} \max_{j} \sum_{k\neq j} \mathbf{1}(d(T_{p_k}, T_{p_j}) < (a, b))$

Performance on unseen-category split for OV9D backbone:

abs IoU@50 = 87.8%
abs 5° 5 cm = 15.8%
rel 5° 5 cm = 26.8%
PCA baseline $<$ 53%; close-set SOTA (IST-Net): $\sim$ 4.5% on relative 5° 5 cm

Implicit-Zoo: 3D Pose Regression from Neural Implicit Representations

Task: Given a color image $I$ of one of the 5,287 released OmniObject3D scenes and the corresponding NeRF $f_\theta$ , regress the unknown camera 6DoF pose $\theta$ .

Pipeline:

Sample NeRF grid ( $32^3$ ), tokenize via 3D convolution or learnable tokenizer
ViT backbone (12L, 3H, embed dim 192) fuses 3D/2D tokens, outputs coarse pose
Further refinement by minimizing NeRF-based photometric loss

Metrics:

Translation Error (TE, cm)
Rotation Error (RE, deg), $RE_\beta =$ % of frames with RE $<$ β°
Best performance on “seen” scenes: TE=3.12 cm, RE=14.59°, $RE@5=47.34\%$ , $RE@30=83.35\%$ , RE+Ref=4.22°
On “unseen” scenes: TE=5.99 cm, RE=20.02°, $RE@5=30.19\%$ , $RE@30=79.75\%$ , RE+Ref=8.09°

This constitutes the first large-scale, generalizable benchmark for 3D pose regression from neural implicit functions.

5. Distinguishing Features and Advances

Compared to prior datasets:

Scale: OO3D-9D offers the largest category count (216) and instance count (5,371) with full 9D pose-size ground truth, surpassing prior datasets by at least an order of magnitude in combined scope and diversity.
Category symmetry: Systematic per-instance symmetry annotation (both discrete and continuous), directly encoded in BOP format, enables symmetry-aware training, loss, and evaluation workflows not previously standardized in category-level pose datasets.
Photorealistic rendering: Large-scale multi-view image synthesis and domain randomization contribute to high scene diversity.
Neural implicit integration: Implicit-Zoo’s use of OmniObject3D establishes the first large-scale library of per-object NeRFs for generic objects, facilitating research into implicit 3D understanding, pose regression, and tokenization strategies.

6. Practical Availability and Usage

OmniObject3D and its derivatives are released for research under licenses restricting use to non-commercial purposes. The official project site provides dataset downloads, documentation, and scripts for benchmarking across the various tasks (https://omniobject3d.github.io/).

OO3D-9D: BOP-format images, masks, depth, ground-truth pose/size, and symmetry metadata enable drop-in use for object pose, category-level recognition, and open-vocabulary 3D understanding tasks.
Implicit-Zoo: Released as MLP-weight dumps, with scripts for tokenization and downstream evaluation; does not include original meshes, but allows point cloud recovery by density queries.
Processing pipelines: BlenderProc2 for rendering, Open3D for point sampling, and COLMAP for video camera trajectory recovery are all integrated into dataset construction.

7. Research Impact and Current Limitations

OmniObject3D and its OO3D-9D and Implicit-Zoo derivatives have established benchmark standards and large-scale, high-fidelity resources for studying category-level 3D perception, open-vocabulary recognition, and implicit 3D representation learning. Notable outcomes include:

Disentangling category symmetry in pose recovery and annotation pipelines.
Advancing open-vocabulary pose estimation with text-conditioned models leveraging synthetic photorealistic data and strong visual-language priors.
Enabling token-based transformer models for 3D understanding via neural implicit volumetric representations.

Limitations observed include semantic imbalance among categories, generation bias in mesh/texture synthesis, challenges in pose and shape recovery for highly concave or low-texture shapes, and limited robustness on sparse-view geometric and photometric benchmarks—a plausible implication is the need for further research into shape-texture disentanglement and cross-category generalization.

Further advances are anticipated in large-scale category-level 3D learning, robust implicit modeling, and the integration of textual and visual priors for open-set, real-world object understanding.

PDF Markdown Chat (Pro)

References (3)

OmniObject3D: Large-Vocabulary 3D Object Dataset for Realistic Perception, Reconstruction and Generation (2023)

OV9D: Open-Vocabulary Category-Level 9D Object Pose and Size Estimation (2024)

Implicit-Zoo: A Large-Scale Dataset of Neural Implicit Functions for 2D Images and 3D Scenes (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to OmniObject3D Dataset.