UniHand-2.0: Multimodal Robotics Dataset

Updated 14 April 2026

The UniHand-2.0 dataset provides a framework integrating 35,000 hours of human, robot, and vision-language data to set new standards in cross-embodiment skill transfer.
It features rich multimodal inputs including RGB, depth, proprioception, hand pose, and task-language data collected from 30 diverse robotic platforms.
The unified action space mapping and synchronized sensor modalities enhance simulation-to-real transfer and support robust task planning in robotics.

UniHand-2.0 is a large-scale multimodal dataset designed for embodied robot learning, providing a comprehensive corpus to support robust cross-embodiment generalization across diverse robotic platforms. The dataset integrates 35,000 hours of human and robot demonstration videos, trajectories, and vision–language supervision, unified under a consistent action and annotation schema. UniHand-2.0 serves as the foundation for the Being-H0.5 Vision-Language-Action (VLA) model, establishing new standards for cross-domain skill transfer, sample diversity, and task coverage in human-centric robotics (Luo et al., 19 Jan 2026).

1. Scale, Modalities, and Robotic Embodiments

UniHand-2.0 encompasses approximately 400 million samples and over 120 billion multimodal tokens, making it the largest embodied pre-training dataset to date. The data composition comprises three core sources:

Egocentric human video data: 16,000 hours, approximately 134 million clips, with hand pose annotations aligned to MANO parameters.
Robot manipulation trajectories: 14,000 hours, ~1.5 billion frames, spanning 13,817 hours of real and simulated data (simulation capped at 26%).
General vision–language corpora: 5,000 hours equivalent, including visual question answering (VQA), spatial grounding, and task planning.

Thirty robotic embodiments are included, categorized as follows:

Category	Number of Platforms	Example Platforms
Single-arm	13	Franka, Kuka-iiwa, UR5E, WidowX, PR2
Dual-arm	5	RMC Aida L, Galaxea R1 Lite, Agilex ALOHA
Portable/education arms	2	BeingBeyond D1, LeRobot SO-101
Half-humanoid & Humanoid	10	PND Adam-U, Agibot-G1, Unitree G1

Sensor modalities include RGB and depth video, proprioceptive measures, end-effector and joint state recordings, and comprehensive task-language data.

2. Sensor Modalities and Data Acquisition

Human demonstrations utilize the UniCraftor system, collecting:

RGB: Intel RealSense D435 at $640 \times 480$ , 30 Hz.
Depth: Active infrared-stereo at 30 Hz, direct from hardware (no learned estimation).
Camera pose: Computed via five AprilTags and PnP for ground-truth extrinsics.
Hand pose: Estimated as MANO parameters (via HaWoR with multi-view refinement).
Foot-pedal events: Timestamped, indicating contact/release actions.

Robot data acquisition features multi-view RGB (ego, third-person, wrist, top), depth (co-registered), proprioception (joint angles/velocities at 50–100 Hz), and control signals (end-effector pose, gripper/finger positions at 10–50 Hz, platform-specific). Planned extensions include force/torque and tactile arrays.

Vision–language tasks rely on images (up to $224 \times 224$ input for models), original resolutions, and text including instructions, VQA pairs, referring expressions, and affordance labels.

3. Annotation Schema and Task Coverage

Manipulation tasks are provided for both human and robot traces:

Human data: In-the-wild, egocentric activities (cooking, tool use, chores) with 43 curated tabletop tasks in UniCraftor subset (200+ h).
Robot data: Pick-and-place, open/close, stacking, hand-over, packaging, wiping, scanning, long-horizon and bimanual procedures, sourced from at least 15 public corpora (e.g., OpenX-Embodiment, AgiBot-World, RoboMIND) and new data for embodiments such as PND Adam-U and BeingBeyond D1.
Vision–language: General VQA (LLaVA-v1.5, LLaVA-Video), 2D spatial grounding (RefCOCO, RoboPoint, PixMo-Points), and task planning (ShareRobot, EO1.5M-QA).

Labeling conventions:

Human videos: Per-second fine-grained instructions and 10-s chunk-level intents; paraphrased via Gemini-2.5 LLM to reduce template bias.
Robot data: Raw state/action sequences with action chunks of length $T$ , aligned with vision frames.
Temporal alignment: All modalities timestamped by a central clock, synchronizing video, depth, hand pose, proprioception, and control signals.

Task-segment boundaries are defined by synchronized pedal events (human) or log markers (robot teleoperation).

4. Data Collection, Processing, and File Structure

Acquisition pipelines involve comprehensive calibration and post-processing:

Calibration: Camera intrinsics per device; extrinsics via AprilTags and PnP; hardware depth to minimize learning artifacts.
Synchronization: Centralized timestamping across all modalities and sensor channels; pedal I/O synchronized.
Post-processing steps:

Inpainting of AprilTag regions in human videos with Grounded-SAM2 and DiffuEraser.
Filtering of hand pose estimates by confidence and DBA error; jitter removal.
Semantic screening with Gemini-2.5 to omit non-manipulative segments.
Left–right mirroring applied to human data to debias handedness.
Robot data deduplicated and frame-downsampled to 30% for increased diversity.

File and directory formats are standardized. For example, the structure contains separate subdirectories for human/robot/VLM/metadata, with detailed breakdowns for video, depth, action/state, calibration, instructions, events, and aggregate statistics.

5. Unified Action Space Mapping

To harmonize heterogeneous control schemes, UniHand-2.0 introduces a fixed-length unified action and state vector composed of semantic slots. The mapping is as follows:

State/action projection:

$\mathbf{s} = \Phi_{e}(\mathbf{s}^{(e)}), \quad \mathbf{a} = \Phi_{e}(\mathbf{a}^{(e)})$

where $\Phi_{e}$ projects embodiment-specific data into semantically consistent subspaces (end-effector delta position, gripper width, finger articulation, base velocity), with zero-padding of unused slots.

Action chunking: Chunks of $n$ commands $u^{(1)}, \dots, u^{(n)}$ mapped to unified space.
Physical parameterization:
- End-effector: $\Delta x, \Delta y, \Delta z$ (world frame), axis–angle in $SO(3)$ .
- Joint angles: absolute radians per revolute joint.
- Gripper/finger: linear width or articulation in SI units.
- Raw magnitudes are preserved; only outlier clipping is used for scale management.

6. Statistics, Serialization, and Losses

Summarized source and robot hours:

Source	Hours	Tokens (B)	% Hours
Human video (MANO-aligned)	16,000	25.6	46%
Robot manipulation	14,000	45.7	40%
Vision–language corpora	5,000	50.2	14%
Total	35,000	121.5	100%

An excerpt for robot embodiment hours:

Category	Platform	Views	Hours
Single-arm	Franka A1/A2	ego, 3rd×2, wrist	2,196.4
Single-arm	Google Robot	ego, 3rd	1,195.2
Dual-arm	RMC Aida L	high, 2×wrist	325.7
Portable/Edu	BeingBeyond D1	ego	100.0
Half-humanoid	PND Adam-U	ego	200.0
Humanoid	Unitree G1edu-u3	ego, 2×wrist	135.7

Multimodal sample serialization uses the schema:

$\mathcal{S} = [ \, (\mathrm{m}_1, C_1),\, (\mathrm{m}_2, C_2),\, \dots,\, (\mathrm{m}_K, C_K) \, ]$

with $224 \times 224$ 0, and non-text wrapped in $224 \times 224$ 1state $224 \times 224$ 2 or $224 \times 224$ 3action $224 \times 224$ 4 tokens.

The action-chunk loss combines a continuous forward model (FM) loss and a discrete mask loss:

$224 \times 224$ 5

where:

$224 \times 224$ 6
$224 \times 224$ 7 with $224 \times 224$ 8 as predicted vector, $224 \times 224$ 9 as action chunk, and $T$ 0 as discrete label.

7. Significance and Accessibility

UniHand-2.0 establishes a new baseline for integrated, human-centric robotics datasets by balancing rich human hand-motion priors, heterogeneous robot trajectories, and vision–language understanding within a unified, time-aligned framework. By constraining simulated data to 26% and emphasizing hand pose and semantic alignment, the dataset minimizes the sim-to-real gap and supports robust generalization across morphologically divergent platforms. All dataset resources, including weights and training code, are openly available at https://research.beingbeyond.com/being-h05 (Luo et al., 19 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Being-H0.5: Scaling Human-Centric Robot Learning for Cross-Embodiment Generalization (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to UniHand-2.0 Dataset.

UniHand-2.0: Multimodal Robotics Dataset

1. Scale, Modalities, and Robotic Embodiments

2. Sensor Modalities and Data Acquisition

3. Annotation Schema and Task Coverage

4. Data Collection, Processing, and File Structure

5. Unified Action Space Mapping

6. Statistics, Serialization, and Losses

7. Significance and Accessibility

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

UniHand-2.0: Multimodal Robotics Dataset

1. Scale, Modalities, and Robotic Embodiments

2. Sensor Modalities and Data Acquisition

3. Annotation Schema and Task Coverage

4. Data Collection, Processing, and File Structure

5. Unified Action Space Mapping

6. Statistics, Serialization, and Losses

7. Significance and Accessibility

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research