Open3DHOI: In-The-Wild 3D HOI Dataset
- Open3DHOI is an open-vocabulary, in-the-wild dataset for 3D human–object interactions, featuring 2,561 annotated images with 370 unique human–object pairs.
- The dataset employs a multi-stage annotation pipeline combining automated mesh extraction and manual filtering to ensure accurate 3D reconstruction and contact mapping.
- A novel Gaussian-HOI optimizer refines contact estimation and spatial alignment, establishing new benchmarks for open-vocabulary 3D HOI understanding and LLM-based tasks.
Open3DHOI is an open-vocabulary, in-the-wild 3D human–object interaction (HOI) dataset, comprising over 2,500 rigorously annotated single-image instances, along with a novel Gaussian-HOI optimizer for fine-grained 3D reconstruction and contact region analysis. The dataset significantly expands the object and action diversity typical of contemporary 3D HOI corpora by leveraging advances in single-image 3D reconstruction, and defines new benchmarks for 3D HOI understanding—including LLM–based reasoning and generative tasks—from single images (Wen et al., 20 Mar 2025).
1. Dataset Composition and Scope
Open3DHOI consists of 2,561 annotated images featuring 370 unique 3D human–object pairs. Each image contains a single object instance, summing to 2,561 object assets and supporting 3,671 HOI triplets involving 120 annotated action categories. Object diversity is a central property: 133 categories spanning animals, food, tools, furniture, electronics, and sports equipment, many drawn from the WordNet hierarchy and absent from prior 3D HOI datasets (“teddy bear,” “wine glass,” “goat,” etc.).
Data is sourced from HAKE-Large (~12,000 images), SWIG-HOI (~3,000 images), and a limited number of additional web-surfaced images. Stringent selection criteria require clear, direct contact HOI (e.g., holding, sitting), moderate occlusion, and single-person presence, with manual filtering to ensure 3D reconstructability and exclude severe crowding or occlusion.
2. Multi-stage Annotation and Quality Assurance Pipeline
The Open3DHOI annotation workflow combines automated and manual procedures for accurate mesh construction and interaction mapping:
- Human mesh extraction: OS-X (One-Stage Whole-Body Transformer, SMPL-X model).
- Object mesh reconstruction: InstantMesh, a single-image 3D object mesh recovery method.
- Segmentation and depth: SAM for masks, ZoeDepth for depth estimation, yielding point cloud partitioned into human/object components.
- Occlusion handling: AmodalMask for predicted occluded regions, Stable Diffusion 1.5 for inpainting.
- Manual tools: Custom filtering/contact web application, Blender add-on for 3D alignment, ImageNet3D-based fine-tuner.
Key pipeline steps:
- Coarse 3D reconstruction uses mesh-to-depth alignment (Algorithm 1), where human/object meshes are iteratively fit to inferred depth and mask-derived point clouds. Rigid scale fitting is performed using:
- Occlusion completion merges mask prediction, possible manual brush correction, and inpainting/mesh re-estimation.
- Manual filtering employs multi-view mesh rendering, interactive mesh “pass/delete,” and, if necessary, user-guided mask re-annotation.
- Contact annotation utilizes a 34-part SMPL-X body subdivision for region-level contact labeling.
- Iterative coarse–fine 3D alignment in Blender and a novel web tool, allowing 6D object adjustments for pixel-accurate compositing.
Quality assurance uses multi-view IoU metrics (human: 0.621, object: 0.384, combined: 0.634), and a human–object penetration metric (3.26% for ground-truth, compared to PHOSA baseline 4.26%), with ~10% cross-checking for annotator consensus.
3. Gaussian-HOI Optimizer: Representation and Objective Functions
Each human/object instance is parameterized as a set of 3D Gaussians, , :
- : 3D Gaussian mean,
- : covariance,
- : learned opacity,
- : learned contact score (for human mesh only).
Rendering for pixel is defined as:
with projected opacity
HOI interactions are encoded by combining human and object Gaussian fields under their respective rigid, scale, and translation transforms. The optimizer further learns contact regions via a composite score:
where is Chamfer distance to the closest object Gaussian, and are scalar weights (usually $0.5$), and normalizes opacity scores to .
The optimization objective combines rendering (, with L1, L2, SSIM, LPIPS terms) and spatial () losses:
where
with contact (Chamfer distance), collision (mesh-penetration penalty), and ordinal-depth constraints.
4. 3D HOI Understanding: Benchmarks and Metrics
Open3DHOI introduces new evaluation paradigms for 3D HOI analysis:
- 3D HOI Understanding: Given a human–object point cloud, predict the interaction verb. Evaluated with PointLLM-7B.
- HOI Pose Chat: Generate an SMPL pose for a specified action–object prompt given the corresponding image. Assessed using the ChatPose model.
Metrics for these tasks include:
- Action/object prediction: Top-1 accuracy over 120 actions/133 objects,
- Pose generation: Mean per-joint position error (MPJPE), mean per-vertex position error (MPVPE),
- Contact region estimation: Micro F1-score, Hamming loss, Jaccard index (for multi-label classification over 34 body parts).
5. Quantitative Results
The performance comparison across methods and optimizer configurations is as follows:
| Method | Scale | Trans. (cm) | Rot. | Chamfer (cm) |
|---|---|---|---|---|
| PHOSA | 0.39 | 77.79 | 0.95 | 49.1 |
| Ours w/o HOI | 0.25 | 38.66 | 0.45 | 16.9 |
| Ours (full) | 0.16 | 38.44 | 0.41 | 19.3 |
| Setup | Co | Collision | Contact |
|---|---|---|---|
| PHOSA | 0.431 | 0.105 | 0.326 |
| + depth + collision + contact | 0.181 | 0.053 | 0.128 |
| Task (Model/Input) | Action | Object | MPJPE (mm) | MPVPE (mm) |
|---|---|---|---|---|
| PointLLM-7B (w/obj) | 0.47 | - | - | - |
| ChatPose (a+o) | - | - | 103.4 | 130.9 |
| Method | Micro F1 | Hamming | Jaccard |
|---|---|---|---|
| 2D only | 0.6118 | 0.0874 | 0.4303 |
| 2D+3D | 0.6207 | 0.0844 | 0.4561 |
Qualitatively, the Gaussian-HOI optimizer demonstrates recovery of accurate object tilt, scale, and human–object contact even in visually complex, cluttered scenes. Failure cases typically involve heavy self-occlusion, indicating the importance of improved structural priors and occlusion modeling.
6. Significance and Future Directions
Open3DHOI is the first in-the-wild, open-vocabulary 3D HOI dataset with comprehensive object and action coverage and principled contact annotation. Its diversity enables generalization to scenes beyond traditional, rigid CAD models and supports LLM-based reasoning on point clouds and pose generation. The articulated multi-stage pipeline provides a template for high-quality 3D HOI annotation leveraging contemporary 2D detector/reconstructor backbones.
A plausible implication is that subsequent research in 3D HOI understanding, human–object contact estimation, and in-the-wild 3D scene understanding can leverage both the dataset and the Gaussian-HOI optimizer as robust baselines and benchmarks. The inclusion of categories and actions absent from prior benchmarks facilitates evaluation of open-vocabulary and generalizable HOI methods, and the new LLM-driven tasks suggest expanding the scope of embodied intelligence benchmarks (Wen et al., 20 Mar 2025).