DexYCB: Multi-View Hand–Object Pose Dataset

Updated 22 March 2026

DexYCB dataset is a marker-less, multi-view RGB-D benchmark for joint 3D hand and 6D object pose estimation under realistic interactions.
It provides comprehensive annotations including 2D/3D keypoints, SE(3) object poses, and MANO-based hand models across 4.6 million images.
The dataset supports advanced benchmark tasks and cross-dataset evaluations to enhance research in grasp understanding and safe human-to-robot handover.

DexYCB is a large-scale, marker-less, real multi-view RGB-D dataset specifically designed for benchmarking and advancing joint 3D hand and object pose estimation under real hand–object interactions. It addresses a fundamental gap in existing datasets, which typically capture either static objects or bare hands, often employ intrusive markers, or rely on synthetic data that do not reflect natural grasp patterns. By capturing dynamic hand–object interactions with dense, human-verified annotations, DexYCB provides a benchmark standard for research in grasp understanding, 3D vision, and human–robot handover (Chao et al., 2021).

1. Dataset Construction and Capture Protocol

DexYCB comprises multi-view RGB-D sequences of 10 human subjects (5 male, 5 female), each performing 100 grasp trials, resulting in 1,000 trials in total. Each trial involves grasping a target object (chosen from 20 standardized YCB objects—examples include "003_cracker_box", "021_bleach_cleanser") in the presence of 2–4 distractors, replicating challenging realistic environments. Trials proceed from an initial relaxed hand pose, followed by reaching, grasping, object lifting, and in some cases, performing a simulated "hand-over" across the tabletop workspace.

Capture is conducted using 8 Intel RealSense D415 RGB-D cameras (640×480 at 30 fps), hardware-synchronized to within ±1 ms and extrinsically calibrated to a shared world coordinate system. The cameras are arranged around the table to minimize occlusions and ensure comprehensive scene coverage. Each trial is repeated 5 times per target, alternating grasping hand for handedness diversity, and both left- and right-hand interactions are captured.

With 582,000 frames per camera view and 8 viewpoints, DexYCB comprises approximately 4.6 million RGB-D images, forming one of the largest datasets of its kind for hand–object interaction (Chao et al., 2021).

2. Annotation Schema and Data Organization

Annotations in DexYCB are comprehensive and multi-modal:

2D Object Instance Masks & Bounding Boxes: Provided per camera view.
2D Keypoints: For hands, 21 keypoints are specified (wrist, 4 joints and fingertip per finger, each labeled for visibility); for each object, two user-defined landmarks are specified per instance and tracked across frames.
3D Hand Pose: Recovered via the MANO hand model with pose parameters $\theta \in \mathbb{R}^{51}$ and shape parameters $\beta \in \mathbb{R}^{10}$ generating a 778-vertex mesh and a canonical 21-joint skeleton for each hand.
6D Object Pose: Represented as frame-wise SE(3) transforms $T = [R|t]$ for each object, where $R \in SO(3)$ , $t \in \mathbb{R}^3$ .

The annotation pipeline employs a human-in-the-loop approach via MTurk, leveraging a VATIC-based tool for manual 2D keypoint labeling on all 8 views per sequence. A multi-view optimization refines these annotations by jointly minimizing reprojection error (for visible 2D keypoints), signed distance function (SDF) error over depth, and an L2 regularization on hand articulation:

$E(P) = E_{\text{depth}}(P) + E_{\text{kpt}}(P) + E_{\text{reg}}(P)$

where

$E_{\text{depth}} = \frac{1}{N_D}\sum_i \|SDF(d_i, M(P))\|^2$
$E_{\text{kpt}}$ is the reprojection error of 2D keypoints,
$E_{\text{reg}} = \lambda \sum_{h} \|\theta_h\|^2$ regularizes hand articulation.

The dataset is structured by sequence into /color/ and /depth/ directories (PNG frames), /calib/ (camera intrinsics and extrinsics), and /labels/ (JSON annotations per frame specifying object instance and hand IDs, bounding boxes, masks, keypoints, 6D poses, and MANO parameters).

3. Benchmark Tasks and Evaluation Protocols

DexYCB defines three principal tasks, with standardized data splits supporting rigorous generalization assessment:

A. 2D Object & Keypoint Detection

Task: Detect objects (20 YCB classes plus "hand" class) and localize the 21 hand joints per frame. Metrics: Standard COCO metrics,

$\mathrm{AP}^{\text{box}}$ , $\mathrm{AP}^{\text{mask}}$ (object detection and instance segmentation),
$\mathrm{AP}^{\text{kp}}$ (Object Keypoint Similarity).

B. 6D Object Pose Estimation

Task: Given RGB or RGB-D image input, estimate the SE(3) pose ( $T_o$ ) of all visible object instances. Metrics: Average Recall (AR) over three BOP pose-error functions,

Symmetric ADD-S:

$ADD-S = \frac{1}{|M|} \sum_{x \in M} \| R x + t - (\hat{R} x + \hat{t}) \|$

MSSD, MSPD: Chamfer- and silhouette-based distances.

Final score is mean recall over the three metrics.

C. 3D Hand Pose Estimation

Task: Regress the 3D position of the 21 hand joints from a single view (RGB or depth). Metrics:

MPJPE (mean per-joint position error):

$MPJPE = \frac{1}{J} \sum_j \| \hat{p}_j - p_j \|_2$

PCK-AUC: Area under the Percentage of Correct Keypoints curve over [0,50 mm].
Evaluation in three alignment modes: absolute, root-relative (wrist-centered), and Procrustes (scale, rotation, translation removed).

Evaluation Splits:

S0 (“default”): all subjects/views/objects (train/val/test by sequence).
S1: unseen subjects.
S2: unseen views.
S3: unseen grasped objects.

4. Cross-Dataset Generalization Analysis

DexYCB enables robust cross-dataset evaluation, directly comparing with HO-3D—the only other multi-view, real hand+object dataset. Using the single-image RGB model of Spurr et al. (HANDS 2019 winner, ResNet50 or HRNet32 backbone), two cross-domain scenarios are evaluated by training on DexYCB or HO-3D alone and testing on both. The primary metric is MPJPE in root-relative and Procrustes alignment modes.

Key results demonstrate that networks trained on DexYCB generalize more successfully to HO-3D (achieving 31.8 mm / 15.2 mm MPJPE) than the reverse direction (48.3 mm / 24.2 mm). Combining the datasets for training further aids HO-3D test accuracy, but does not yield an advantage for DexYCB test performance. This suggests that DexYCB encompasses a broader diversity of grasp patterns, offering broader coverage of the hand–object grasp space than HO-3D (Chao et al., 2021).

5. Safe Human–to–Robot Handover Evaluation

DexYCB facilitates evaluation of robotics-relevant tasks, specifically safe human-to-robot handover. The task is to generate SE(3) parallel-jaw gripper poses that achieve a valid, non-colliding grasp of an object held by a human. Reference grasp sets per object are pre-sampled (100 per object, via farthest-point sampling from an offline database [Eppner et al., ISRR 2019]), transformed to the camera frame using ground-truth or estimated pose, and filtered to exclude grasps that penetrate the object or hand mesh.

Predicted grasp sets are generated by:

Estimating the object’s 6D pose,
Transforming candidate grasps to the camera frame,
Removing grasps intersecting with depth-map-based hand point cloud or object mesh.

Evaluation utilizes Coverage and Precision:

$\text{Coverage} = \frac{|\{g \in \mathcal{R} : \exists h \in \chi \text{ s.t. } \text{dist}(g, h) < \tau\}|}{|\mathcal{R}|}$
$\text{Precision} = \frac{|\{h \in \chi : \exists g \in \mathcal{R} \text{ s.t. } \text{dist}(g, h) < \tau\}|}{|\chi|}$

with pose distance determined by translation tolerance $\sigma_t=5$ cm and orientation tolerance $\sigma_q=15^\circ$ .

Precision–Coverage curves indicate that advanced 6D pose algorithms (e.g. CosyPose [ECCV 2020]) deliver higher safe-handover coverage and precision versus earlier methods (e.g. PoseCNN [RSS 2018]). Failure cases are frequently attributed to minor object pose errors or hand segmentation inaccuracies, particularly under significant occlusion (Chao et al., 2021).

6. Access, Licensing, and Recommended Usage

DexYCB is publicly available at https://dex-ycb.github.io under the Creative Commons CC BY-NC-SA license (non-commercial research use). Best practices emphasize the use of S0–S3 data splits for evaluating subject, viewpoint, and object generalization, as well as joint training on 2D detection, 6D pose, and 3D hand pose tasks to optimize handover and grasp understanding performance. For robotics research, combining DexYCB with limited “in-the-wild” data can enhance robustness in hand segmentation and real-world generalization (Chao et al., 2021).

DexYCB establishes the first large-scale, marker-less, real multi-view RGB-D dataset of dynamic hand–object interaction annotated with joint 6D object and 3D hand pose. Its breadth (582,000 frames; 10 subjects × 20 objects × 8 views) and annotation accuracy position it as a central benchmark for machine perception and robotics research on human-like grasping and collaboration.

Markdown Report Issue Upgrade to Chat

References (1)

DexYCB: A Benchmark for Capturing Hand Grasping of Objects (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DexYCB Dataset.