Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 100 tok/s
Gemini 2.5 Pro 58 tok/s Pro
GPT-5 Medium 29 tok/s
GPT-5 High 29 tok/s Pro
GPT-4o 103 tok/s
GPT OSS 120B 480 tok/s Pro
Kimi K2 215 tok/s Pro
2000 character limit reached

HERMES: Human-to-Robot Embodied Learning from Multi-Source Motion Data for Mobile Dexterous Manipulation (2508.20085v3)

Published 27 Aug 2025 in cs.RO

Abstract: Leveraging human motion data to impart robots with versatile manipulation skills has emerged as a promising paradigm in robotic manipulation. Nevertheless, translating multi-source human hand motions into feasible robot behaviors remains challenging, particularly for robots equipped with multi-fingered dexterous hands characterized by complex, high-dimensional action spaces. Moreover, existing approaches often struggle to produce policies capable of adapting to diverse environmental conditions. In this paper, we introduce HERMES, a human-to-robot learning framework for mobile bimanual dexterous manipulation. First, HERMES formulates a unified reinforcement learning approach capable of seamlessly transforming heterogeneous human hand motions from multiple sources into physically plausible robotic behaviors. Subsequently, to mitigate the sim2real gap, we devise an end-to-end, depth image-based sim2real transfer method for improved generalization to real-world scenarios. Furthermore, to enable autonomous operation in varied and unstructured environments, we augment the navigation foundation model with a closed-loop Perspective-n-Point (PnP) localization mechanism, ensuring precise alignment of visual goals and effectively bridging autonomous navigation and dexterous manipulation. Extensive experimental results demonstrate that HERMES consistently exhibits generalizable behaviors across diverse, in-the-wild scenarios, successfully performing numerous complex mobile bimanual dexterous manipulation tasks. Project Page:https://gemcollector.github.io/HERMES/.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces a unified RL framework that integrates teleoperation, mocap, and video data to teach mobile bimanual robots versatile manipulation skills.
  • It employs a hybrid sim2real control strategy with closed-loop PnP localization, achieving a mean success rate of 67.8% and a +54% improvement over baselines.
  • Key insights include high sample efficiency, effective DAgger-based vision policy distillation, and robust generalization to varied object geometries and environments.

HERMES: Human-to-Robot Embodied Learning from Multi-Source Motion Data for Mobile Dexterous Manipulation

Introduction and Motivation

HERMES presents a unified framework for mobile bimanual dexterous manipulation, leveraging heterogeneous human motion data sources—teleoperation, mocap, and raw video—to impart versatile manipulation skills to robots equipped with multi-fingered hands. The system addresses the embodiment gap and sim2real transfer challenges by integrating reinforcement learning (RL), vision-based policy distillation, and a closed-loop navigation-localization pipeline. The framework is designed to generalize across diverse, unstructured environments and complex manipulation tasks, with a focus on high sample efficiency and robust real-world deployment. Figure 1

Figure 1: HERMES demonstrates a broad spectrum of mobile bimanual dexterous manipulation skills in real-world scenarios, learned from one-shot human motion.

System Architecture

HERMES comprises a mobile base, dual 6-DoF arms, and two 6-DoF dexterous hands, with high-fidelity simulation models constructed in MuJoCo and MJX. The simulation accurately models passive joints and collision dynamics using equality constraints and primitive shape approximations, facilitating stable and realistic training. The hardware setup includes RGBD and fisheye cameras for manipulation and navigation, respectively, and is controlled via ROS on an RTX 4090-equipped laptop. Figure 2

Figure 2: Unified mobile bimanual robot setup in simulation and real world, enabling sim2real transfer for complex tasks.

Learning from Multi-Source Human Motion

HERMES supports three modalities for human motion acquisition:

  • Teleoperation in simulation: Direct control via Apple Vision Pro, capturing hand/arm poses at 75 Hz.
  • Mocap data: Retargeted from OakInk2, with RL compensating for embodiment discrepancies.
  • Video extraction: WiLoR for hand pose estimation and FoundationPose for object trajectories, with PnP-based alignment to the robot frame. Figure 3

    Figure 3: FoundationPose and WiLoR extract object and hand trajectories from raw video for robot learning.

    Figure 4

    Figure 4: WiLoR and PnP precisely transform estimated hand poses into the robot’s frame.

Trajectory augmentation is performed by randomizing object positions/orientations, enabling spatial generalization from a single demonstration. DexPilot is used for initial retargeting, followed by RL refinement.

Reinforcement Learning and Reward Design

Tasks are formulated as goal-conditioned MDPs, with state ss including proprioception and goal state from the reference trajectory. The reward function is unified across tasks, comprising:

  • Object-centric distance chain: Temporal variation of vectors between object center and fingertips/palm, activated only with sufficient contact points. Figure 5

    Figure 5: Object-centric distance chain reward tracks spatial relationships between object and hand keypoints.

  • Object trajectory tracking: Penalizes deviation in position and orientation from the reference.
  • Power penalty: Reduces actuation jitter.

Residual action learning is employed: arm actions are decomposed into coarse (from human trajectory) and fine (network-predicted) components, while hand actions are fully network-driven. Early termination and collision disabling are used to improve exploration efficiency.

DrM (off-policy, dormant ratio) and PPO (on MJX for parallel training) are implemented, with high sample efficiency and cross-algorithm generality.

Vision-Based Sim2Real Transfer

State-based RL policies are distilled into vision-based policies via DAgger, using depth images as input. Depth augmentation includes clipping, Gaussian noise/blur, and mixup with NYU Depth Dataset, achieving semantic and distributional alignment between simulation and real-world depth maps. Figure 6

Figure 6: Depth image comparison shows strong semantic correspondence between simulated and real-world representations after preprocessing.

Figure 7

Figure 7: Depth intensity distributions from simulation and real-world images are closely aligned.

DAgger distillation uses stacked depth frames and a ResNet-18 encoder (with GroupNorm), with a rollout scheduler annealing expert/student policy usage. L1/L2 action losses and proprioception noise injection further improve generalization.

Hybrid Sim2Real Control

A hybrid control strategy is adopted: real-world observations infer actions, which are executed in simulation to compute joint values, then mapped to the real robot. This maintains dynamic consistency and mitigates sim2real discrepancies. Figure 8

Figure 8: Hybrid sim2real control leverages real-world observations for action inference and simulation for joint computation.

ViNT is used for image-goal navigation, supporting long-horizon, zero-shot generalization. However, ViNT alone does not guarantee precise pose alignment. HERMES introduces a closed-loop PnP localization step:

  • Efficient LoFTR extracts dense feature correspondences between current and goal images.
  • PnP (RANSAC + refinement) estimates relative pose, with PID controllers adjusting x, y, and yaw sequentially to minimize errors. Figure 9

    Figure 9: Closed-loop PnP localization pipeline iteratively refines robot pose for precise alignment.

Experimental Results

Sample Efficiency and Generalization

HERMES demonstrates high sample efficiency across seven tasks, outperforming ObjDex in both single-object and multi-object scenarios. RL-based policies significantly outperform kinematic retargeting and direct replay, especially under randomized object poses. Figure 10

Figure 10: Training curves show HERMES achieves superior sample efficiency and task completion across diverse human motion sources.

Figure 11

Figure 11: HERMES learns nuanced object interactions beyond kinematic retargeting.

Figure 12

Figure 12: Simulation visualizations of diverse training tasks from single reference trajectories.

Figure 13

Figure 13: Parallel training in MJX yields high wall-time efficiency and strong asymptotic performance.

Real-World Manipulation and Sim2Real Transfer

Zero-shot transfer of DAgger-trained policies achieves a mean success rate of 67.8% across six real-world bimanual tasks, outperforming raw depth baselines by +54.5%. Fine-tuning is only required for tasks with high visual noise or transparent objects.

Closed-loop PnP reduces localization errors to 1.3–3.2 cm (translation) and 0.57–2.06° (orientation), outperforming ViNT and RGBD-SLAM (RTAB-MAP) especially in textureless scenarios. Figure 14

Figure 14: HERMES achieves close alignment of terminal and target point clouds, outperforming ViNT in localization accuracy.

Figure 15

Figure 15: HERMES maintains precise localization in environments with sparse visual features.

Mobile Manipulation

End-to-end evaluation shows HERMES achieves a +54% improvement in manipulation success rate over ViNT-only localization, demonstrating the necessity of closed-loop PnP for bridging navigation and manipulation. Figure 16

Figure 16: HERMES outperforms ViNT-only baseline in real-world mobile bimanual manipulation tasks.

Instance Generalization

Object geometry randomization during training enables zero-shot generalization to novel object shapes. Figure 17

Figure 17: Policy adapts to randomized object geometries for robust manipulation.

Closed-Loop PnP and Feature Matching

Iterative closed-loop PnP achieves high-precision pose alignment across scenarios. Figure 18

Figure 18: Closed-loop PnP visualization shows iterative refinement of robot pose.

Efficient LoFTR provides dense, high-frequency feature correspondences for robust PnP estimation. Figure 19

Figure 19: Efficient LoFTR establishes dense correspondences for PnP pose estimation.

DAgger Training Efficiency

HERMES attains high sample efficiency and asymptotic performance across task types, outperforming pure imitation and student training. Figure 20

Figure 20: DAgger training curves demonstrate sample efficiency and robust performance.

Hybrid Control Consistency

Hybrid control maintains consistent joint dynamics between simulation and real robot, reducing sim2real gaps. Figure 21

Figure 21: Hybrid control aligns simulated and real joint trajectories, minimizing dynamic discrepancies.

Depth-Anything Comparison

Depth-Anything achieves semantic alignment but exhibits quantitative distributional gaps, resulting in lower success rates compared to HERMES’s direct depth augmentation. Figure 22

Figure 22: Depth-Anything generates semantically aligned depth images between simulation and real world.

Figure 23

Figure 23: Depth-Anything depth maps reveal pronounced quantitative gaps between domains.

Implications and Future Directions

HERMES demonstrates that unified RL-based frameworks, leveraging diverse human motion sources and robust sim2real transfer, can endow mobile bimanual robots with generalizable dexterous manipulation skills. The closed-loop PnP localization is critical for bridging navigation and manipulation, especially in unstructured and textureless environments. The hybrid control strategy and depth-based vision pipeline are effective for mitigating sim2real gaps.

Limitations include reliance on quasi-static tasks, manual tuning of collision parameters, and hardware-simulation calibration mismatches. Future work should address dynamic tasks, automate simulation setup, and improve hardware robustness.

Conclusion

HERMES establishes a comprehensive pipeline for mobile bimanual dexterous manipulation, integrating multi-source human motion learning, RL-based policy refinement, vision-based sim2real transfer, and closed-loop navigation-localization. The framework achieves high sample efficiency, robust sim2real transfer, and strong generalization in real-world scenarios, providing a solid foundation for future research in embodied robot learning and mobile manipulation.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

X Twitter Logo Streamline Icon: https://streamlinehq.com