Papers
Topics
Authors
Recent
Search
2000 character limit reached

UniHand-2.0: Multimodal Robotics Dataset

Updated 14 April 2026
  • The UniHand-2.0 dataset provides a framework integrating 35,000 hours of human, robot, and vision-language data to set new standards in cross-embodiment skill transfer.
  • It features rich multimodal inputs including RGB, depth, proprioception, hand pose, and task-language data collected from 30 diverse robotic platforms.
  • The unified action space mapping and synchronized sensor modalities enhance simulation-to-real transfer and support robust task planning in robotics.

UniHand-2.0 is a large-scale multimodal dataset designed for embodied robot learning, providing a comprehensive corpus to support robust cross-embodiment generalization across diverse robotic platforms. The dataset integrates 35,000 hours of human and robot demonstration videos, trajectories, and vision–language supervision, unified under a consistent action and annotation schema. UniHand-2.0 serves as the foundation for the Being-H0.5 Vision-Language-Action (VLA) model, establishing new standards for cross-domain skill transfer, sample diversity, and task coverage in human-centric robotics (Luo et al., 19 Jan 2026).

1. Scale, Modalities, and Robotic Embodiments

UniHand-2.0 encompasses approximately 400 million samples and over 120 billion multimodal tokens, making it the largest embodied pre-training dataset to date. The data composition comprises three core sources:

  • Egocentric human video data: 16,000 hours, approximately 134 million clips, with hand pose annotations aligned to MANO parameters.
  • Robot manipulation trajectories: 14,000 hours, ~1.5 billion frames, spanning 13,817 hours of real and simulated data (simulation capped at 26%).
  • General vision–language corpora: 5,000 hours equivalent, including visual question answering (VQA), spatial grounding, and task planning.

Thirty robotic embodiments are included, categorized as follows:

Category Number of Platforms Example Platforms
Single-arm 13 Franka, Kuka-iiwa, UR5E, WidowX, PR2
Dual-arm 5 RMC Aida L, Galaxea R1 Lite, Agilex ALOHA
Portable/education arms 2 BeingBeyond D1, LeRobot SO-101
Half-humanoid & Humanoid 10 PND Adam-U, Agibot-G1, Unitree G1

Sensor modalities include RGB and depth video, proprioceptive measures, end-effector and joint state recordings, and comprehensive task-language data.

2. Sensor Modalities and Data Acquisition

Human demonstrations utilize the UniCraftor system, collecting:

  • RGB: Intel RealSense D435 at 640×480640 \times 480, 30 Hz.
  • Depth: Active infrared-stereo at 30 Hz, direct from hardware (no learned estimation).
  • Camera pose: Computed via five AprilTags and PnP for ground-truth extrinsics.
  • Hand pose: Estimated as MANO parameters (via HaWoR with multi-view refinement).
  • Foot-pedal events: Timestamped, indicating contact/release actions.

Robot data acquisition features multi-view RGB (ego, third-person, wrist, top), depth (co-registered), proprioception (joint angles/velocities at 50–100 Hz), and control signals (end-effector pose, gripper/finger positions at 10–50 Hz, platform-specific). Planned extensions include force/torque and tactile arrays.

Vision–language tasks rely on images (up to 224×224224 \times 224 input for models), original resolutions, and text including instructions, VQA pairs, referring expressions, and affordance labels.

3. Annotation Schema and Task Coverage

Manipulation tasks are provided for both human and robot traces:

  • Human data: In-the-wild, egocentric activities (cooking, tool use, chores) with 43 curated tabletop tasks in UniCraftor subset (200+ h).
  • Robot data: Pick-and-place, open/close, stacking, hand-over, packaging, wiping, scanning, long-horizon and bimanual procedures, sourced from at least 15 public corpora (e.g., OpenX-Embodiment, AgiBot-World, RoboMIND) and new data for embodiments such as PND Adam-U and BeingBeyond D1.
  • Vision–language: General VQA (LLaVA-v1.5, LLaVA-Video), 2D spatial grounding (RefCOCO, RoboPoint, PixMo-Points), and task planning (ShareRobot, EO1.5M-QA).

Labeling conventions:

  • Human videos: Per-second fine-grained instructions and 10-s chunk-level intents; paraphrased via Gemini-2.5 LLM to reduce template bias.
  • Robot data: Raw state/action sequences with action chunks of length TT, aligned with vision frames.
  • Temporal alignment: All modalities timestamped by a central clock, synchronizing video, depth, hand pose, proprioception, and control signals.

Task-segment boundaries are defined by synchronized pedal events (human) or log markers (robot teleoperation).

4. Data Collection, Processing, and File Structure

Acquisition pipelines involve comprehensive calibration and post-processing:

  • Calibration: Camera intrinsics per device; extrinsics via AprilTags and PnP; hardware depth to minimize learning artifacts.
  • Synchronization: Centralized timestamping across all modalities and sensor channels; pedal I/O synchronized.
  • Post-processing steps:
  1. Inpainting of AprilTag regions in human videos with Grounded-SAM2 and DiffuEraser.
  2. Filtering of hand pose estimates by confidence and DBA error; jitter removal.
  3. Semantic screening with Gemini-2.5 to omit non-manipulative segments.
  4. Left–right mirroring applied to human data to debias handedness.
  5. Robot data deduplicated and frame-downsampled to 30% for increased diversity.

File and directory formats are standardized. For example, the structure contains separate subdirectories for human/robot/VLM/metadata, with detailed breakdowns for video, depth, action/state, calibration, instructions, events, and aggregate statistics.

5. Unified Action Space Mapping

To harmonize heterogeneous control schemes, UniHand-2.0 introduces a fixed-length unified action and state vector composed of semantic slots. The mapping is as follows:

  • State/action projection:

s=Φe(s(e)),a=Φe(a(e))\mathbf{s} = \Phi_{e}(\mathbf{s}^{(e)}), \quad \mathbf{a} = \Phi_{e}(\mathbf{a}^{(e)})

where Φe\Phi_{e} projects embodiment-specific data into semantically consistent subspaces (end-effector delta position, gripper width, finger articulation, base velocity), with zero-padding of unused slots.

  • Action chunking: Chunks of nn commands u(1),,u(n)u^{(1)}, \dots, u^{(n)} mapped to unified space.
  • Physical parameterization:
    • End-effector: Δx,Δy,Δz\Delta x, \Delta y, \Delta z (world frame), axis–angle in SO(3)SO(3).
    • Joint angles: absolute radians per revolute joint.
    • Gripper/finger: linear width or articulation in SI units.
    • Raw magnitudes are preserved; only outlier clipping is used for scale management.

6. Statistics, Serialization, and Losses

Summarized source and robot hours:

Source Hours Tokens (B) % Hours
Human video (MANO-aligned) 16,000 25.6 46%
Robot manipulation 14,000 45.7 40%
Vision–language corpora 5,000 50.2 14%
Total 35,000 121.5 100%

An excerpt for robot embodiment hours:

Category Platform Views Hours
Single-arm Franka A1/A2 ego, 3rd×2, wrist 2,196.4
Single-arm Google Robot ego, 3rd 1,195.2
Dual-arm RMC Aida L high, 2×wrist 325.7
Portable/Edu BeingBeyond D1 ego 100.0
Half-humanoid PND Adam-U ego 200.0
Humanoid Unitree G1edu-u3 ego, 2×wrist 135.7

Multimodal sample serialization uses the schema:

S=[(m1,C1),(m2,C2),,(mK,CK)]\mathcal{S} = [ \, (\mathrm{m}_1, C_1),\, (\mathrm{m}_2, C_2),\, \dots,\, (\mathrm{m}_K, C_K) \, ]

with 224×224224 \times 2240, and non-text wrapped in 224×224224 \times 2241state224×224224 \times 2242 or 224×224224 \times 2243action224×224224 \times 2244 tokens.

The action-chunk loss combines a continuous forward model (FM) loss and a discrete mask loss:

224×224224 \times 2245

where:

  • 224×224224 \times 2246
  • 224×224224 \times 2247 with 224×224224 \times 2248 as predicted vector, 224×224224 \times 2249 as action chunk, and TT0 as discrete label.

7. Significance and Accessibility

UniHand-2.0 establishes a new baseline for integrated, human-centric robotics datasets by balancing rich human hand-motion priors, heterogeneous robot trajectories, and vision–language understanding within a unified, time-aligned framework. By constraining simulated data to 26% and emphasizing hand pose and semantic alignment, the dataset minimizes the sim-to-real gap and supports robust generalization across morphologically divergent platforms. All dataset resources, including weights and training code, are openly available at https://research.beingbeyond.com/being-h05 (Luo et al., 19 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to UniHand-2.0 Dataset.