R+X: Retrieval and Execution from Everyday Human Videos (2407.12957v2)

Published 17 Jul 2024 in cs.RO and cs.LG

Abstract: We present R+X, a framework which enables robots to learn skills from long, unlabelled, first-person videos of humans performing everyday tasks. Given a language command from a human, R+X first retrieves short video clips containing relevant behaviour, and then executes the skill by conditioning an in-context imitation learning method (KAT) on this behaviour. By leveraging a Vision LLM (VLM) for retrieval, R+X does not require any manual annotation of the videos, and by leveraging in-context learning for execution, robots can perform commanded skills immediately, without requiring a period of training on the retrieved videos. Experiments studying a range of everyday household tasks show that R+X succeeds at translating unlabelled human videos into robust robot skills, and that R+X outperforms several recent alternative methods. Videos and code are available at https://www.robot-learning.uk/r-plus-x.

Citations (7)

View on Semantic Scholar

Summary

The paper introduces a two-stage framework that extracts relevant video segments from unlabelled first-person footage to enable real-time robotic task execution.
The retrieval phase employs a Vision Language Model and 3D keypoint extraction to transform raw video into an actionable format.
The execution phase uses in-context imitation learning with keypoint action tokens to achieve superior household task performance over baselines.

R+X: Retrieval and Execution from Everyday Human Videos

The paper R+X: Retrieval and Execution from Everyday Human Videos presents a novel framework for learning robotic skills from long, unlabelled, first-person videos of humans carrying out daily tasks. This paper was conducted by Georgios Papagiannis, Norman Di Palo, Pietro Vitiello, and Edward Johns from The Robot Learning Lab at Imperial College London. The primary aim of R+X is to enable robots to immediately execute tasks based on language commands without requiring pre-training or manual labeling of video data.

Framework Overview

R+X operates in two stages: Retrieval and Execution. For the Retrieval phase, a Vision LLM (VLM) retrieves short video clips from a long, unlabelled first-person video based on a provided language command. The Execution phase leverages in-context imitation learning to condition robot actions on the retrieved video clips. This approach circumvents the need for explicit policy training and allows robots to execute tasks in real-time.

Retrieval Phase

The Retrieval phase employs a Vision LLM (Gemini Pro 1.5 Flash) to identify video segments that match the tasks described by language prompts. This process is unattached to human intervention post-deployment, as the video used is naturally captured during daily human activities. The retrieved clips are transformed into a lower-dimensional 3D representation by extracting visual keypoints and human hand trajectories using DINO-based models and the HaMeR model respectively. These transformations abstract the complexity of raw video data into a format more manageable for robotic execution.

Execution Phase

The Execution phase uses Keypoint Action Tokens (KAT) for few-shot, in-context imitation learning. Given real-time visual observations, the model predicts a sequence of 3D hand joint actions that the robot follows. This method allows the robot to execute tasks described in language commands based on the context provided by the visual keypoints and hand trajectories from the retrieved clips. The paper outlines specific heuristics used to map human hand joints to the robotic gripper poses, crucial for accurate task execution.

Experimental Validation

The experiments conducted evaluate the robustness of R+X in executing twelve everyday household tasks. The paper compares R+X against two baseline methods: R3M-DiffLang and Octo. The baseline methods employ monolithic language-conditioned policies. Contrary to these baselines, R+X demonstrated superior performance in both success rate and generalization to new objects and scenes.

Task Performance

The experimental results indicated that R+X outperformed the baseline methods across all tested tasks. The retrieval and execution design enabled R+X to maintain high performance even under various spatial and language generalization challenges. For example, in tasks like "grasp a can" and "grasp a beer," R+X successfully handled hard spatial generalizations where objects were placed in previously unseen positions.

Scalability and Learning Efficiency

One of the salient advantages of R+X is its scalability and efficiency in learning new tasks. Unlike monolithic policies that suffer from catastrophic forgetting and require re-training with increased data, R+X continuously improves by simply expanding the video dataset. This feature is particularly beneficial for long-term deployment where robots encounter new tasks that were not part of their initial training set.

Implications and Future Directions

The implications of R+X are manifold, encompassing practical applications and theoretical advancements. Practically, this framework allows for rapid deployment of robotic systems in dynamic environments without extensive pre-training. Theoretically, R+X challenges the reliance on heavily curated datasets and extensive training periods, pushing the boundaries towards more autonomous and adaptive robotic learning systems.

Future research could focus on refining the stabilisation techniques for first-person videos, enhancing the precision of hand pose prediction models, and optimizing the keypoint extraction process in increasingly complex environments. Additionally, exploring the integration of more dynamic and adaptive keypoint configurations may further bolster the framework’s robustness and efficiency.

Conclusion

R+X represents a significant step forward in robotic learning from unlabelled human activity videos. By leveraging the dual stages of retrieval and execution combined with the power of VLMs and in-context imitation learning, this framework mitigates the need for extensive pre-training and demonstrates strong capabilities in task generalisation and immediate execution. As Vision LLMs continue to evolve, the potential for frameworks like R+X to revolutionize autonomous robotic systems becomes increasingly tangible.