Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation (2409.16283v1)

Published 24 Sep 2024 in cs.RO, cs.CV, cs.LG, and eess.IV

Abstract: How can robot manipulation policies generalize to novel tasks involving unseen object types and new motions? In this paper, we provide a solution in terms of predicting motion information from web data through human video generation and conditioning a robot policy on the generated video. Instead of attempting to scale robot data collection which is expensive, we show how we can leverage video generation models trained on easily available web data, for enabling generalization. Our approach Gen2Act casts language-conditioned manipulation as zero-shot human video generation followed by execution with a single policy conditioned on the generated video. To train the policy, we use an order of magnitude less robot interaction data compared to what the video prediction model was trained on. Gen2Act doesn't require fine-tuning the video model at all and we directly use a pre-trained model for generating human videos. Our results on diverse real-world scenarios show how Gen2Act enables manipulating unseen object types and performing novel motions for tasks not present in the robot data. Videos are at https://homangab.github.io/gen2act/

PDF Abstract

Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation

The paper "Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation" introduces a novel approach to robot manipulation, focusing on generalization to novel tasks, unseen object types, and new motions.

Overview

The core challenge addressed in this work is the generalization of robot manipulation policies to previously unseen tasks and object types, which is particularly crucial for practical deployment in diverse real-world scenarios. Existing methods often struggle because of the high cost and impracticality of collecting vast amounts of robot interaction data covering every possible scenario. This paper proposes an innovative solution leveraging pre-trained human video generation models trained on web data, effectively sidestepping the need for extensive robot-specific datasets.

Methodology

The proposed approach involves two primary steps:

Zero-shot Human Video Generation:
- The paper makes use of a pre-trained video generation model (VideoPoet) to generate human manipulation videos based on task descriptions and initial scene images. Unlike prior works that might fine-tune models for robot-specific contexts, this method leverages the zero-shot capabilities of state-of-the-art video generation models trained on expansive web data.
Conditioned Policy Execution:
- A closed-loop policy is then conditioned on these generated videos. This policy doesn’t operate directly on robot-specific data but instead learns from the visual and motion cues inherent in the generated human videos.

To handle the temporal aspect and enhance the learning from motion cues, the model incorporates point track prediction as an auxiliary task during training. While training, the policy uses both visual features from the videos and extracted motion information to optimize behavior cloning objectives.

Results and Analysis

Generalization Performance

The paper evaluates the system across various levels of generalization:

Mild Generalization (MG): Unseen configurations of seen objects in familiar scenes.
Standard Generalization (G): Unseen object instances in both seen and unseen scenes.
Object-Type Generalization (OTG): Completely new object types in novel scenarios.
Motion-Type Generalization (MTG): New motion types not encountered during training.

The results exhibit significant improvements over baselines:

For OTG and MTG, the approach yields a substantial average absolute success rate improvement of approximately 30% over the most competitive baseline.
Qualitative results further validate the capacity to generate realistic, plausible human manipulations in diverse environments, aligning well with the robot’s task goals.

Long-Horizon Manipulation

The utility of the approach for sequential tasks (e.g., "making coffee" or "cleaning a table") is demonstrated by chaining the video generation and policy execution steps. By extracting task sequences through an LLM like Gemini, the system can perform multi-step tasks involving several intermediate operations, further showcasing the robustness and flexibility of the methodology.

Implications

The practical implications are noteworthy:

Scalability: By utilizing human video generation, the approach circumvents the bottleneck of robot data collection, making it scalable to a wider array of tasks without additional data curation.
Adaptability: The ability to generalize across unseen objects and motions implies significant advancements towards fully autonomous robots capable of adapting to new environments and tasks with minimal human intervention.

Future Directions

The paper hints at several promising future developments:

Enhanced Motion Extraction: Future work could explore denser motion information, such as object meshes, improving performance on complex, dexterous tasks.
Long-Horizon Activities: Integrating recovery policies and more advanced long-horizon planning could further extend the approach’s utility in real-world applications.

Conclusion

In summary, the Gen2Act framework presents a novel paradigm for generalizable robot manipulation by effectively leveraging pre-trained human video models from web data. Its design and results illustrate the potential for enhanced scalability and adaptability in robotic systems, providing a valuable stepping stone towards more versatile and independent robotic assistants.