Papers
Topics
Authors
Recent
2000 character limit reached

Open3DHOI: In-The-Wild 3D HOI Dataset

Updated 4 February 2026
  • Open3DHOI is an open-vocabulary, in-the-wild dataset for 3D human–object interactions, featuring 2,561 annotated images with 370 unique human–object pairs.
  • The dataset employs a multi-stage annotation pipeline combining automated mesh extraction and manual filtering to ensure accurate 3D reconstruction and contact mapping.
  • A novel Gaussian-HOI optimizer refines contact estimation and spatial alignment, establishing new benchmarks for open-vocabulary 3D HOI understanding and LLM-based tasks.

Open3DHOI is an open-vocabulary, in-the-wild 3D human–object interaction (HOI) dataset, comprising over 2,500 rigorously annotated single-image instances, along with a novel Gaussian-HOI optimizer for fine-grained 3D reconstruction and contact region analysis. The dataset significantly expands the object and action diversity typical of contemporary 3D HOI corpora by leveraging advances in single-image 3D reconstruction, and defines new benchmarks for 3D HOI understanding—including LLM–based reasoning and generative tasks—from single images (Wen et al., 20 Mar 2025).

1. Dataset Composition and Scope

Open3DHOI consists of 2,561 annotated images featuring 370 unique 3D human–object pairs. Each image contains a single object instance, summing to 2,561 object assets and supporting 3,671 HOI triplets involving 120 annotated action categories. Object diversity is a central property: 133 categories spanning animals, food, tools, furniture, electronics, and sports equipment, many drawn from the WordNet hierarchy and absent from prior 3D HOI datasets (“teddy bear,” “wine glass,” “goat,” etc.).

Data is sourced from HAKE-Large (~12,000 images), SWIG-HOI (~3,000 images), and a limited number of additional web-surfaced images. Stringent selection criteria require clear, direct contact HOI (e.g., holding, sitting), moderate occlusion, and single-person presence, with manual filtering to ensure 3D reconstructability and exclude severe crowding or occlusion.

2. Multi-stage Annotation and Quality Assurance Pipeline

The Open3DHOI annotation workflow combines automated and manual procedures for accurate mesh construction and interaction mapping:

  • Human mesh extraction: OS-X (One-Stage Whole-Body Transformer, SMPL-X model).
  • Object mesh reconstruction: InstantMesh, a single-image 3D object mesh recovery method.
  • Segmentation and depth: SAM for masks, ZoeDepth for depth estimation, yielding point cloud SS partitioned into human/object components.
  • Occlusion handling: AmodalMask for predicted occluded regions, Stable Diffusion 1.5 for inpainting.
  • Manual tools: Custom filtering/contact web application, Blender add-on for 3D alignment, ImageNet3D-based fine-tuner.

Key pipeline steps:

  1. Coarse 3D reconstruction uses mesh-to-depth alignment (Algorithm 1), where human/object meshes are iteratively fit to inferred depth and mask-derived point clouds. Rigid ++ scale fitting is performed using:

scale=avgi,jH0iH0javgi,jShiShj,translation=mean(Sh)scalemean(H0)\text{scale} = \frac{\text{avg}_{i,j}\|H_0^i-H_0^j\|}{\text{avg}_{i,j}\|S_h^i-S_h^j\|}, \quad \text{translation} = \text{mean}(S_h) - \text{scale} \cdot \text{mean}(H_0)

  1. Occlusion completion merges mask prediction, possible manual brush correction, and inpainting/mesh re-estimation.
  2. Manual filtering employs multi-view mesh rendering, interactive mesh “pass/delete,” and, if necessary, user-guided mask re-annotation.
  3. Contact annotation utilizes a 34-part SMPL-X body subdivision for region-level contact labeling.
  4. Iterative coarse–fine 3D alignment in Blender and a novel web tool, allowing 6D object adjustments for pixel-accurate compositing.

Quality assurance uses multi-view IoU metrics (human: 0.621, object: 0.384, combined: 0.634), and a human–object penetration metric (3.26% for ground-truth, compared to PHOSA baseline 4.26%), with ~10% cross-checking for annotator consensus.

3. Gaussian-HOI Optimizer: Representation and Objective Functions

Each human/object instance is parameterized as a set of NsN^s 3D Gaussians, gs={(μis,Σis,αis,cis)i=1Ns}g^s = \{ (\mu_i^s, \Sigma_i^s, \alpha_i^s, c_i^s) \mid i=1\ldots N^s \}, s{human, object}s \in \{\text{human, object}\}:

  • μi\mu_i: 3D Gaussian mean,
  • Σi\Sigma_i: 3×33 \times 3 covariance,
  • αi\alpha_i: learned opacity,
  • cic_i: learned contact score (for human mesh only).

Rendering for pixel xx is defined as:

C(x)=n=1Ncnαnj=1n1(1αj),C(x) = \sum_{n=1}^{N} c_n\,\alpha_n^{\prime} \prod_{j=1}^{n-1}(1-\alpha_j^{\prime}),

with projected opacity

αn=αnexp[12(xμn)Σn1(xμn)].\alpha_n^{\prime} = \alpha_n \cdot \exp\left[ -\frac{1}{2}(x' - \mu_n')^\top \Sigma_n^{\prime\,-1} (x'-\mu_n') \right].

HOI interactions are encoded by combining human and object Gaussian fields under their respective rigid, scale, and translation transforms. The optimizer further learns contact regions via a composite score:

ci=wαNorm(αih)+wddC(pih,po),c_i = w_\alpha \cdot \mathrm{Norm}(\alpha_i^h) + w_d \cdot d_C(p_i^h, p^o),

where dCd_C is Chamfer distance to the closest object Gaussian, wαw_\alpha and wdw_d are scalar weights (usually $0.5$), and Norm()\mathrm{Norm}(\cdot) normalizes opacity scores to [0,1][0,1].

The optimization objective combines rendering (LrL_r, with L1, L2, SSIM, LPIPS terms) and spatial (LhoiL_{\text{hoi}}) losses:

L=wrLr+whoiLhoi,L = w_r L_r + w_{\text{hoi}} L_{\text{hoi}},

where

Lhoi=Lcont+Lcolli+Ldepth,L_{\text{hoi}} = L_{\text{cont}} + L_{\text{colli}} + L_{\text{depth}},

with contact (Chamfer distance), collision (mesh-penetration penalty), and ordinal-depth constraints.

4. 3D HOI Understanding: Benchmarks and Metrics

Open3DHOI introduces new evaluation paradigms for 3D HOI analysis:

  1. 3D HOI Understanding: Given a human–object point cloud, predict the interaction verb. Evaluated with PointLLM-7B.
  2. HOI Pose Chat: Generate an SMPL pose for a specified action–object prompt given the corresponding image. Assessed using the ChatPose model.

Metrics for these tasks include:

  • Action/object prediction: Top-1 accuracy over 120 actions/133 objects,
  • Pose generation: Mean per-joint position error (MPJPE), mean per-vertex position error (MPVPE),
  • Contact region estimation: Micro F1-score, Hamming loss, Jaccard index (for multi-label classification over 34 body parts).

5. Quantitative Results

The performance comparison across methods and optimizer configurations is as follows:

Method Scale \downarrow Trans. (cm) \downarrow Rot. \downarrow Chamfer (cm) \downarrow
PHOSA 0.39 77.79 0.95 49.1
Ours w/o HOI 0.25 38.66 0.45 16.9
Ours (full) 0.16 38.44 0.41 19.3
Setup Co2^2 \downarrow Collision \downarrow Contact \downarrow
PHOSA 0.431 0.105 0.326
+ depth + collision + contact 0.181 0.053 0.128
Task (Model/Input) Action \uparrow Object \uparrow MPJPE (mm) \downarrow MPVPE (mm) \downarrow
PointLLM-7B (w/obj) 0.47 - - -
ChatPose (a+o) - - 103.4 130.9
Method Micro F1 \uparrow Hamming \downarrow Jaccard \uparrow
2D only 0.6118 0.0874 0.4303
2D+3D 0.6207 0.0844 0.4561

Qualitatively, the Gaussian-HOI optimizer demonstrates recovery of accurate object tilt, scale, and human–object contact even in visually complex, cluttered scenes. Failure cases typically involve heavy self-occlusion, indicating the importance of improved structural priors and occlusion modeling.

6. Significance and Future Directions

Open3DHOI is the first in-the-wild, open-vocabulary 3D HOI dataset with comprehensive object and action coverage and principled contact annotation. Its diversity enables generalization to scenes beyond traditional, rigid CAD models and supports LLM-based reasoning on point clouds and pose generation. The articulated multi-stage pipeline provides a template for high-quality 3D HOI annotation leveraging contemporary 2D detector/reconstructor backbones.

A plausible implication is that subsequent research in 3D HOI understanding, human–object contact estimation, and in-the-wild 3D scene understanding can leverage both the dataset and the Gaussian-HOI optimizer as robust baselines and benchmarks. The inclusion of categories and actions absent from prior benchmarks facilitates evaluation of open-vocabulary and generalizable HOI methods, and the new LLM-driven tasks suggest expanding the scope of embodied intelligence benchmarks (Wen et al., 20 Mar 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Open3DHOI.