Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers (2403.12943v2)

Published 19 Mar 2024 in cs.RO and cs.AI

Abstract: Large-scale multi-task robotic manipulation systems often rely on text to specify the task. In this work, we explore whether a robot can learn by observing humans. To do so, the robot must understand a person's intent and perform the inferred task despite differences in the embodiments and environments. We introduce Vid2Robot, an end-to-end video-conditioned policy that takes human videos demonstrating manipulation tasks as input and produces robot actions. Our model is trained with a large dataset of prompt video-robot trajectory pairs to learn unified representations of human and robot actions from videos. Vid2Robot uses cross-attention transformer layers between video features and the current robot state to produce the actions and perform the same task as shown in the video. We use auxiliary contrastive losses to align the prompt and robot video representations for better policies. We evaluate Vid2Robot on real-world robots and observe over 20% improvement over BC-Z when using human prompt videos. Further, we also show cross-object motion transfer ability that enables video-conditioned policies to transfer a motion observed on one object in the prompt video to another object in the robot's own environment. Videos available at https://vid2robot.github.io

PDF Abstract

Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers

Introduction

This paper introduces "Vid2Robot," an innovative approach leveraging video demonstrations for end-to-end policy learning in robotics. This framework represents a significant step in robotics, aiming to bridge the gap between human demonstrations and robotic execution without explicit task descriptions. By extracting task semantics directly from videos, Vid2Robot facilitates the development of versatile robots capable of learning new skills flexibly from visual demonstrations, enriching the potential for real-world applications.

Approach

Dataset Creation

Creating a robust dataset is fundamental to training the Vid2Robot model. The dataset comprises paired instances of a demonstration video and corresponding robot actions executing the same task. The demonstrations include human and robot participants, labeled according to three main data sources: Robot-Robot, Hindsight Human-Robot, and Co-located Human-Robot pairs. This diversity in the dataset aims to capture a wide range of tasks and variances in task execution, essential for training a versatile and adaptable policy.

Model Architecture

The architecture of Vid2Robot consists of four key components:

Prompt Video Encoder and Robot State Encoder, both utilizing Transformer-based models, encode the demonstration video and current robot observation into a uniform representation.
State-Prompt Encoder fuses the encoded state and prompt information, enabling the model to understand the task's context and required actions.
Robot Action Decoder predicts a sequence of robot actions to replicate the task demonstrated in the video. Notably, Cross-Attention mechanisms play a crucial role across these components, enhancing the model's ability to focus on relevant features in the video and robot's state for accurate action prediction.

Training Procedure

Vid2Robot's training methodology combines direct action prediction from demonstrations with three auxiliary losses:

Temporal Video Alignment ensures temporal consistency between the demonstration and robot-executed videos.
Prompt-Robot Video Contrastive Loss and
Video-Text Contrastive Loss aim to align the video representations with each other and with text descriptions of the tasks, respectively. These auxiliary losses are designed to enhance the quality of learned video representations, crucial for understanding and replicating human demonstrations accurately.

Experiments and Results

Vid2Robot was evaluated using real-world robot setups, demonstrating a 20% improvement in performance over existing video-conditioned policies. Notably, the model showcased emergent capabilities, such as transferring observed actions from the demonstrations to novel objects and executing long-horizon tasks. These results underscore the effectiveness of the Vid2Robot's approach to learning from video demonstrations.

Implications and Future Work

Vid2Robot opens up new avenues for robotic learning, significantly reducing the reliance on detailed task descriptions. The potential for robots to learn directly from videos paves the way for more natural and versatile human-robot interactions. Future developments may explore the scaling of this approach to more complex and longer tasks, further reducing the gap between human capabilities and robotic execution.

Conclusion

"Vid2Robot" represents a significant advancement in robot policy learning, demonstrating the feasibility and effectiveness of directly translating visual demonstrations into robotic actions. With potential applications across diverse real-world scenarios, this approach moves us closer to the goal of creating truly adaptable and versatile robots capable of learning new tasks in a more human-like manner.