Hoi! Dataset: Multimodal Articulated Manipulation

Updated 4 March 2026

Hoi! Dataset is a comprehensive multimodal benchmark that integrates visual, kinesthetic, and tactile signals for evaluating articulated object manipulation.
It enables rigorous testing of cross-view and cross-embodiment transfer, supporting research in force-from-vision and sim-to-real policy optimization.
The dataset provides precise calibration, detailed annotations, and real-world capture of manipulation episodes to facilitate robust force and tactile estimation.

The Hoi! Dataset provides a highly multimodal benchmark for force-grounded, cross-view articulated manipulation, uniquely coupling visual, kinesthetic, and tactile sensing across both human and robotic embodiments. Targeted at the study of physical interaction with articulated objects in natural environments, Hoi! offers synchronized RGB-D video, force/torque, tactile, pose, and 3D scan data for a variety of open/close operations, supporting evaluation of transfer between human and robotic manipulation and facilitating research into force-from-vision, cross-view learning, and sim-to-real policy optimization (Engelbracht et al., 4 Dec 2025).

1. Scope, Conception, and Motivation

Hoi! is designed to fill the methodological gap between large-scale first-person activity datasets (which generally omit forces and tactile signals) and robot-centric datasets (which are restricted in embodiment diversity and usually lack dense multimodal correspondences between human and robot agents). The dataset addresses the need for benchmarks where “what is seen,” “what is done,” and “what is felt” are linked explicitly for the same articulated object, under real-world conditions and standardized cross-device calibration (Engelbracht et al., 4 Dec 2025).

2. Dataset Composition and Embodiment Conditions

Each data point in Hoi! consists of a manipulation sequence in which an articulated part (drawer, cabinet, fridge door, etc.) is operated in one of four distinct end-effector embodiments:

Embodiment	End-effector	Capture Modalities
Human hand	5-finger hand	Exo (2× RGB-D iPhone), Ego (Aria)
Human hand + wrist camera	5-finger hand	Exo + Ego + wrist-mounted Aria (RGB); wrist IMU
UMI gripper	Antipodal UMI gripper	Exo + Ego + wrist-mounted RGB; wrist IMU
Hoi! gripper (custom)	2-finger, parallel robot	Exo + Ego + wrist-mounted ZED Mini (RGB-D); IMU, 6-DoF F/T, tactile

The dataset comprises 3,048 manipulation episodes on 381 unique articulated parts, captured in 38 furnished indoor scenes (kitchens, bathrooms, offices, living rooms). Each interaction is multiply observed: two calibrated exocentric (static) RGB-D views, one egocentric (projective Aria glasses; includes SLAM, hand/eye pose), and (for relevant conditions) a wrist-mounted camera. For gripper embodiments, haptic data include 6-DoF force-torque at the wrist, synchronized Digit (GelSight) tactile, and motor current readouts (Engelbracht et al., 4 Dec 2025).

3. Acquisition Modalities, Calibration, and Ground Truth

All data streams are spatially and temporally synchronized:

Temporal: Sessions are synchronized using a dynamic QR code that displays Unix timestamps at 25 Hz for all cameras, providing device clock offsets with ±10–25 ms accuracy.
Spatial: All camera poses (Aria, ZED, UMI) are registered to a global “world” frame computed by hierarchical localization (hloc) and aligned to dense Leica RTC360 scans (pose RMSE ≈ 5 mm, 0.01 rad relative to Qualisys motion capture).
Annotation: Each articulated part is labeled with joint type (prismatic/revolute), joint axis/position (via Werby et al., CoRL ’25), and per-part 3D masks using SAM v2. Sequence-level language descriptions (part names, goals) are also provided (Engelbracht et al., 4 Dec 2025).

Force/torque signals are provided from the Bota SensONE sensor and transformed between sensor and gripper interaction frames, as:

$f^{\mathrm{meas}}_S = f^{\mathrm{ext}}_S + f^g_S + b_f$
$f^{\mathrm{ext}}_i = R_{i \leftarrow S} f^{\mathrm{ext}}_S$ where $S$ is the force/torque sensor frame, $i$ is the interaction frame, $f^g$ includes gravity, and $b_f$ is the sensor bias. The dataset computes tangential, normal, and combined force magnitudes:
$F^{\mathrm{(tang)}} = \left\|\sum_{k \in \{\mathrm{L}, \mathrm{R}\}} [f_{i,k,x}^{\mathrm{ext}}, f_{i,k,y}^{\mathrm{ext}}] \right\|_2$
$F^{\mathrm{(norm)}} = \sum_{k \in \{\mathrm{L}, \mathrm{R}\}} |f_{i,k,z}^{\mathrm{ext}}|$
$F^{\mathrm{(comb)}} = \sqrt{(F^{\mathrm{(tang)}})^2 + (F^{\mathrm{(norm)}})^2}$ (Engelbracht et al., 4 Dec 2025).

4. Data Structure, Splits, and Usage

Each sequence contains:

Synchronized RGB-D video streams: Exo1.mp4, Exo2.mp4, Ego.mp4, Wrist.mp4 (where available)
Depth frames: Exo1_depth.ply, ZED_depth.bag
Force data: ft_sensor.csv (6D), Digit tactile data, motor_current.csv
Pose data: aria_poses.json, umi_poses.txt, camera calibration files
3D scan and annotation: world_mesh.ply, annotations.json

No fixed train/val/test split is enforced; an environment-held-out protocol is recommended for robust generalization. Cross-view and cross-embodiment transfer protocols are explicitly encouraged: e.g., train on exocentric, test on egocentric, or train on human, test on robot (Engelbracht et al., 4 Dec 2025).

5. Supported Tasks and Benchmarks

Hoi! enables evaluation of both low-level physical reasoning and high-level transfer:

Articulated Object Estimation: Predict joint type and axis from RGB(-D) video. Baselines: GPT-5 Vision LLM (VLM), ArtGS, ArtiPoint. GPT-5 achieves 87.5% accuracy prismatic, 70.3% revolute (exocentric view).
Tactile Force Estimation: Infer tangential, normal, and combined forces from GelSight tactile; Sparsh (DINO backbone) yields combined RMSE ≈ 3.86 N on Hoi!.
Visual Force Estimation: Predict force exertion from RGB-D + prompt; ForceSight achieves RMSE_projected ≈ 2.23 N (compared to 0.40 N on prior datasets).
Egocentric Hand Pose Estimation (exploratory): Pavlakos et al. model achieves [email protected] of ≈ 0.70, underperforming relative to existing egocentric benchmarks (0.84–0.89).

A plausible implication is that the substantial domain gap and physical complexity of real-world articulated furniture, coupled with unique viewpoint/haptic combinations, challenge current SOTA methods and present strong transfer opportunities (Engelbracht et al., 4 Dec 2025).

6. Data Access, Licensing, and Protocols

Hoi! is released under CC BY 4.0 and is publicly downloadable (e.g., dataset.hoi.ethz.ch). All synchronization and calibration files are included. Users are advised to leverage the provided world–device transforms for multimodal alignment, normalize force ranges for model compatibility, and use cross-view/embodiment testing to fully exploit the dataset’s structure (Engelbracht et al., 4 Dec 2025).

7. Significance and Impact

Hoi! uniquely supports the study of physically grounded perception across visual and haptic modalities, bridging human and robot learning with real, articulated objects under naturalistic manipulation. The dataset’s explicitly multimodal, multi-embodiment protocol provides a platform for development and benchmarking of sim-to-real transfer, active force estimation from vision, and manipulation policy learning robust to embodiment and viewpoint. By anchoring every episode in metric 3D scans with ground-truth force-torque and tactile, Hoi! enables quantitative analysis of cross-modal and cross-agent generalization and establishes a reference for future research in physically interactive AI (Engelbracht et al., 4 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Hoi! -- A Multimodal Dataset for Force-Grounded, Cross-View Articulated Manipulation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hoi! Dataset.