- The paper presents HRDexDB, a dataset capturing paired human and robotic grasps with high spatiotemporal resolution and sub-millimeter tracking accuracy.
- It employs a unified multi-camera system and robust 3D pose reconstruction to enable detailed cross-embodiment analysis with synchronized visual, kinematic, and tactile data.
- The dataset benchmarks diverse robotic hands, revealing embodiment-specific failure modes and offering actionable insights for adaptive dexterous manipulation policy development.
HRDexDB: A Large-Scale Dataset for Cross-Embodiment Dexterous Manipulation
Introduction
HRDexDB addresses a fundamental limitation in dexterous manipulation research: the lack of comprehensive, paired human-robot datasets capturing physical interaction at high spatiotemporal resolution. By providing 1.4K manipulation sequences spanning human hands and three distinct robotic hand platforms—each sequence performed across 100 diverse objects and synchronized among visual, kinematic, geometric, and tactile sensing modalities—HRDexDB supplies a unique resource for embodiment-agnostic policy learning, fine-grained grasp analysis, and cross-domain imitation. The dataset’s design enables rigorous study of physical interaction by pairing every robot grasp with a semantically matched human demonstration, all expressed in a unified metric coordinate frame. HRDexDB is accompanied by robust quantitative benchmarks demonstrating sub-millimeter reconstruction accuracy, enhanced tracking fidelity via dense multi-view video, and explicit annotation of both successful and unsuccessful grasp attempts, supporting fine-grained robustness analysis.
Figure 1: Visual overview of HRDexDB, including representative objects, aligned human-robot grasps, and relational contact maps, capturing the breadth of physical phenomena and geometries in the dataset.
Capture System and Protocol
HRDexDB’s data generation relies on a unified multi-camera capture stage with 21 fixed RGB cameras to resolve hand-object occlusions omnipresent in contact-rich manipulation. This high-coverage visual setup is complemented by two egocentric views (robotic or human helmet-mounted) for authentic first-person observations, and it precisely synchronizes all frames via hardware triggers and global timestamp alignment. Robotic sequences are teleoperated using Xsens/Manus IMU gloves, which offer stable and metrically consistent pose tracking for both full-body and finger articulation—this setup avoids vision-based tracking jitter and occlusion-induced failures.
Figure 2: System overview, detailing the dense camera array for multi-view recording, the protocol for human and robot trials, and the sensor-equipped teleoperation system for high-fidelity grasp execution.
Paired data acquisition is accomplished through a two-phase protocol:
- Human demonstration: A subject naturally grasps the target object, generating the exemplar sequence.
- Robot teleoperation: A human operator reenacts the demonstrated strategy with each robotic hand embodiment, generating the paired sequence. This process retains grasp intent and environmental factors while exposing embodiment-specific effects on manipulation.
Figure 3: Illustration of a semantically aligned paired sequence, demonstrating consistent grasping intent across human and robot hands on the same object.
Multi-Modal Data Reconstruction
HRDexDB reconstructs all 3D hand poses using multi-view fitting of the MANO model and refines object 6D pose via a hybrid stereo-depth and global registration pipeline. Hand pose parameters are triangulated using 21-view keypoint detection and RANSAC; personalized mesh models are produced for each human subject to ensure anatomical fidelity. For objects, FoundationStereo-initialized depth and segmentation refine the pose, with geometric cross-view rendering constraints that substantially reduce drift, especially in long and occluded sequences.
The paired nature of the data allows for all spatial signals—including hand, robot, and object poses—to be mapped into a unified world frame, enabling direct contact analysis, embodiment-to-embodiment imitation, and metric evaluation of grasp similarity.
Dataset Analysis and Quantitative Results
Tracking Consistency
In dexterous grasp modeling, sub-millimeter tracking consistency is critical. HRDexDB quantitatively demonstrates that increasing the number of views from 4 to 21 monotonically decreases object pose variance between repeated sequences (mean vertex error reduces from 1.71 mm to 0.83 mm), an operational regime unattainable with previous datasets constrained to sparse camera views.
Figure 4: Consistency of pose projections and reduction in Mean Vertex Distance as camera count is increased, highlighting the necessity of dense view capture for sub-millimeter accuracy.
The dataset enables construction of high-fidelity contact heatmaps, showing consistent affordance regions across human and robot hands despite significant differences in mechanical actuation and morphology. The geometric precision of mesh-to-mesh contact modeling allows for robust affordance transfer and physical grounding of manipulation strategies.
Figure 5: High-resolution contact affordances computed from mesh proximity, revealing both inter- and intra-embodiment consistency in contact semantics.
Embodiment-Specific Failure
By annotating both successful and failed robotic grasps, HRDexDB explicitly reveals the physical limits inherent to different robot hands. For instance, in grasping a “blue metallic vase,” the Inspire F1 achieved 71% success via sufficient force closure, whereas the Allegro hand failed in all attempts due to insufficient actuation strength and friction. These outcomes highlight that kinematic feasibility does not guarantee robustness—a trait inseparable from hardware characteristics.
Figure 6: Analysis of embodiment-specific grasping failures, with stable force closure on Inspire F1 contrasted by consistent slippage from the Allegro hand.
Limitations and Implications
While HRDexDB establishes a new benchmark for cross-embodiment, contact-rich manipulation data, it presents several open challenges. The tactile sensing array is not fully consistent across all robot hands, and tactile measurements are absent in both human and Allegro trials. This sensor heterogeneity complicates cross-modality analysis and motivates future work on shared latent representations for touch. The dataset’s focus remains on establishing robust, semantically paired trajectories rather than formalizing exact pointwise correspondence; defining a mathematically rigorous trajectory equivalence across morphological gaps remains an unsolved problem with implications for policy transfer and grounding.
On practical grounds, HRDexDB is positioned as a foundation for training multimodal, generalist manipulation policies and data-driven foundation models for robotics. The dataset’s paired structure, measurement precision, and inclusion of negative outcomes directly support research in cross-domain imitation, grasp robustness prediction, and adaptive policy architecture. Its annotated failures uniquely enable learning physically aware, embodiment-specific manipulation skills.
Conclusion
HRDexDB provides the robotics and embodied AI communities with an unprecedented dataset of paired human-robot dexterous manipulation, integrating high-resolution visual, kinematic, tactile, and egocentric streams under tight spatial and temporal alignment. Its explicit recording of successful and failed grasps, dense multi-view reconstruction, and cross-embodiment alignment make it an indispensable resource for advancing physically grounded generalist manipulation, transfer learning, and contact-rich policy development. Future expansions will further improve the diversity and complexity of the dataset, with the explicit aim of enabling scalable training and evaluation of versatile, robust, and human-aligned dexterous agents.