UniHand-2.0: Multimodal Robotics Dataset
- The UniHand-2.0 dataset provides a framework integrating 35,000 hours of human, robot, and vision-language data to set new standards in cross-embodiment skill transfer.
- It features rich multimodal inputs including RGB, depth, proprioception, hand pose, and task-language data collected from 30 diverse robotic platforms.
- The unified action space mapping and synchronized sensor modalities enhance simulation-to-real transfer and support robust task planning in robotics.
UniHand-2.0 is a large-scale multimodal dataset designed for embodied robot learning, providing a comprehensive corpus to support robust cross-embodiment generalization across diverse robotic platforms. The dataset integrates 35,000 hours of human and robot demonstration videos, trajectories, and vision–language supervision, unified under a consistent action and annotation schema. UniHand-2.0 serves as the foundation for the Being-H0.5 Vision-Language-Action (VLA) model, establishing new standards for cross-domain skill transfer, sample diversity, and task coverage in human-centric robotics (Luo et al., 19 Jan 2026).
1. Scale, Modalities, and Robotic Embodiments
UniHand-2.0 encompasses approximately 400 million samples and over 120 billion multimodal tokens, making it the largest embodied pre-training dataset to date. The data composition comprises three core sources:
- Egocentric human video data: 16,000 hours, approximately 134 million clips, with hand pose annotations aligned to MANO parameters.
- Robot manipulation trajectories: 14,000 hours, ~1.5 billion frames, spanning 13,817 hours of real and simulated data (simulation capped at 26%).
- General vision–language corpora: 5,000 hours equivalent, including visual question answering (VQA), spatial grounding, and task planning.
Thirty robotic embodiments are included, categorized as follows:
| Category | Number of Platforms | Example Platforms |
|---|---|---|
| Single-arm | 13 | Franka, Kuka-iiwa, UR5E, WidowX, PR2 |
| Dual-arm | 5 | RMC Aida L, Galaxea R1 Lite, Agilex ALOHA |
| Portable/education arms | 2 | BeingBeyond D1, LeRobot SO-101 |
| Half-humanoid & Humanoid | 10 | PND Adam-U, Agibot-G1, Unitree G1 |
Sensor modalities include RGB and depth video, proprioceptive measures, end-effector and joint state recordings, and comprehensive task-language data.
2. Sensor Modalities and Data Acquisition
Human demonstrations utilize the UniCraftor system, collecting:
- RGB: Intel RealSense D435 at , 30 Hz.
- Depth: Active infrared-stereo at 30 Hz, direct from hardware (no learned estimation).
- Camera pose: Computed via five AprilTags and PnP for ground-truth extrinsics.
- Hand pose: Estimated as MANO parameters (via HaWoR with multi-view refinement).
- Foot-pedal events: Timestamped, indicating contact/release actions.
Robot data acquisition features multi-view RGB (ego, third-person, wrist, top), depth (co-registered), proprioception (joint angles/velocities at 50–100 Hz), and control signals (end-effector pose, gripper/finger positions at 10–50 Hz, platform-specific). Planned extensions include force/torque and tactile arrays.
Vision–language tasks rely on images (up to input for models), original resolutions, and text including instructions, VQA pairs, referring expressions, and affordance labels.
3. Annotation Schema and Task Coverage
Manipulation tasks are provided for both human and robot traces:
- Human data: In-the-wild, egocentric activities (cooking, tool use, chores) with 43 curated tabletop tasks in UniCraftor subset (200+ h).
- Robot data: Pick-and-place, open/close, stacking, hand-over, packaging, wiping, scanning, long-horizon and bimanual procedures, sourced from at least 15 public corpora (e.g., OpenX-Embodiment, AgiBot-World, RoboMIND) and new data for embodiments such as PND Adam-U and BeingBeyond D1.
- Vision–language: General VQA (LLaVA-v1.5, LLaVA-Video), 2D spatial grounding (RefCOCO, RoboPoint, PixMo-Points), and task planning (ShareRobot, EO1.5M-QA).
Labeling conventions:
- Human videos: Per-second fine-grained instructions and 10-s chunk-level intents; paraphrased via Gemini-2.5 LLM to reduce template bias.
- Robot data: Raw state/action sequences with action chunks of length , aligned with vision frames.
- Temporal alignment: All modalities timestamped by a central clock, synchronizing video, depth, hand pose, proprioception, and control signals.
Task-segment boundaries are defined by synchronized pedal events (human) or log markers (robot teleoperation).
4. Data Collection, Processing, and File Structure
Acquisition pipelines involve comprehensive calibration and post-processing:
- Calibration: Camera intrinsics per device; extrinsics via AprilTags and PnP; hardware depth to minimize learning artifacts.
- Synchronization: Centralized timestamping across all modalities and sensor channels; pedal I/O synchronized.
- Post-processing steps:
- Inpainting of AprilTag regions in human videos with Grounded-SAM2 and DiffuEraser.
- Filtering of hand pose estimates by confidence and DBA error; jitter removal.
- Semantic screening with Gemini-2.5 to omit non-manipulative segments.
- Left–right mirroring applied to human data to debias handedness.
- Robot data deduplicated and frame-downsampled to 30% for increased diversity.
File and directory formats are standardized. For example, the structure contains separate subdirectories for human/robot/VLM/metadata, with detailed breakdowns for video, depth, action/state, calibration, instructions, events, and aggregate statistics.
5. Unified Action Space Mapping
To harmonize heterogeneous control schemes, UniHand-2.0 introduces a fixed-length unified action and state vector composed of semantic slots. The mapping is as follows:
- State/action projection:
where projects embodiment-specific data into semantically consistent subspaces (end-effector delta position, gripper width, finger articulation, base velocity), with zero-padding of unused slots.
- Action chunking: Chunks of commands mapped to unified space.
- Physical parameterization:
- End-effector: (world frame), axis–angle in .
- Joint angles: absolute radians per revolute joint.
- Gripper/finger: linear width or articulation in SI units.
- Raw magnitudes are preserved; only outlier clipping is used for scale management.
6. Statistics, Serialization, and Losses
Summarized source and robot hours:
| Source | Hours | Tokens (B) | % Hours |
|---|---|---|---|
| Human video (MANO-aligned) | 16,000 | 25.6 | 46% |
| Robot manipulation | 14,000 | 45.7 | 40% |
| Vision–language corpora | 5,000 | 50.2 | 14% |
| Total | 35,000 | 121.5 | 100% |
An excerpt for robot embodiment hours:
| Category | Platform | Views | Hours |
|---|---|---|---|
| Single-arm | Franka A1/A2 | ego, 3rd×2, wrist | 2,196.4 |
| Single-arm | Google Robot | ego, 3rd | 1,195.2 |
| Dual-arm | RMC Aida L | high, 2×wrist | 325.7 |
| Portable/Edu | BeingBeyond D1 | ego | 100.0 |
| Half-humanoid | PND Adam-U | ego | 200.0 |
| Humanoid | Unitree G1edu-u3 | ego, 2×wrist | 135.7 |
Multimodal sample serialization uses the schema:
with 0, and non-text wrapped in 1state2 or 3action4 tokens.
The action-chunk loss combines a continuous forward model (FM) loss and a discrete mask loss:
5
where:
- 6
- 7 with 8 as predicted vector, 9 as action chunk, and 0 as discrete label.
7. Significance and Accessibility
UniHand-2.0 establishes a new baseline for integrated, human-centric robotics datasets by balancing rich human hand-motion priors, heterogeneous robot trajectories, and vision–language understanding within a unified, time-aligned framework. By constraining simulated data to 26% and emphasizing hand pose and semantic alignment, the dataset minimizes the sim-to-real gap and supports robust generalization across morphologically divergent platforms. All dataset resources, including weights and training code, are openly available at https://research.beingbeyond.com/being-h05 (Luo et al., 19 Jan 2026).