HAND Me the Data: Fast Robot Adaptation via Hand Path Retrieval (2505.20455v3)

Published 26 May 2025 in cs.RO

Abstract: We hand the community HAND, a simple and time-efficient method for teaching robots new manipulation tasks through human hand demonstrations. Instead of relying on task-specific robot demonstrations collected via teleoperation, HAND uses easy-to-provide hand demonstrations to retrieve relevant behaviors from task-agnostic robot play data. Using a visual tracking pipeline, HAND extracts the motion of the human hand from the hand demonstration and retrieves robot sub-trajectories in two stages: first filtering by visual similarity, then retrieving trajectories with similar behaviors to the hand. Fine-tuning a policy on the retrieved data enables real-time learning of tasks in under four minutes, without requiring calibrated cameras or detailed hand pose estimation. Experiments also show that HAND outperforms retrieval baselines by over 2x in average task success rates on real robots. Videos can be found at our project website: https://liralab.usc.edu/handretrieval/.

Summary

The paper introduces HAND, a novel method for fast robot adaptation to new tasks using simple human hand demonstrations and retrieving relevant behaviors from task-agnostic robot play data.
HAND extracts 2D motion paths from hand demos (CoTracker3) and retrieves matching robot sub-trajectories from play data via S-DTW, enabling adaptation in under four minutes and doubling success rates over baselines.
Evaluated in simulation and real-world tasks, HAND shows robust retrieval and policy adaptation across varied scenes and objects, reducing reliance on expert data for practical robot deployment.

Overview of "HAND Me the Data: Fast Robot Adaptation via Hand Path Retrieval"

The paper "HAND Me the Data: Fast Robot Adaptation via Hand Path Retrieval" presents a novel approach to facilitate rapid robot adaptation for executing new manipulation tasks via human hand demonstrations. This method, referred to as HAND, leverages easy-to-provide hand demonstrations to retrieve relevant behaviors from task-agnostic robot play data. The paper proposes a framework that effectively bridges the gap between human demonstrations and robotic actions, enabling efficient training of robot policies without requiring extensive teleoperation data.

Methodology

HAND utilizes a two-stage pipeline for behavior retrieval from play data:

Path Extraction: The technique starts with extracting 2D motion paths from human hand demonstrations using a visual point-tracking model. This model, CoTracker3, accurately tracks the hand's movement across video frames, providing a sequence of motion through relative 2D coordinates.
Sub-Trajectory Retrieval: The framework filters the robot’s play data based on visual similarity, then retrieves sub-trajectories that exhibit similar motion patterns to the human demonstration. Subsequence Dynamic Time Warping (S-DTW) is employed to identify and align the robot's movement patterns with the demonstrated hand paths.

A significant claim of the paper is the ability to adapt pre-trained play policies to new tasks in under four minutes by fine-tuning on the retrieved sub-trajectories. This rapid adaptation is highlighted by HAND’s average task success rates outperforming retrieval baselines by over 2x.

Experimental Results

Experiments were conducted both in simulation environments and on a physical WidowX robot arm. HAND was evaluated on eight diverse tasks:

Simulation: The CALVIN benchmark was employed to test HAND’s performance across new scenes and task setups. The paper reports 16% average improvement in success rate compared to prior methods, demonstrating effective retrieval and policy adaptation without requiring task labels for initial training data.
Real-World Deployment: Tasks were tested in a kitchen manipulation environment, achieving notable success in reaching, pushing, closing, and long-horizon tasks like positioning a K-Cup in a coffee machine. HAND demonstrated robust retrieval even when used with hand demos collected from entirely different scenes.

Impact and Future Directions

The implications of HAND’s methodology are substantial for practical robotics applications, particularly in settings where users need to quickly deploy robots for varied tasks without specialized knowledge. The robust retrieval mechanism based on motion instead of visual appearance facilitates the generalization of learned policies across different object types and scenarios.

For theoretical exploration, the paper opens avenues for enhancing retrieval methods by incorporating depth perception to improve estimates of hand trajectories in three dimensions, potentially increasing accuracy in task execution. Additionally, integrating multiple modality features to handle more complex manipulations involving dexterous or deformable objects could further extend HAND’s utility in robotics.

Conclusion

Overall, HAND presents an efficient and scalable framework for robot adaptation to new tasks using intuitive human demonstrations. This method reduces the dependency on task-specific, expert-generated data, offering a practical solution for dynamic and human-centric robotic environments. As AI and robotic systems continue to evolve, methods like HAND will be pivotal in advancing seamless human-robot collaboration.

Tweets

https://twitter.com/Jesse_Y_Zhang/status/1928248099996324301

https://twitter.com/OWW/status/1927711367559086244