This&That: Language-Gesture Controlled Video Generation for Robot Planning (2407.05530v1)

Published 8 Jul 2024 in cs.RO, cs.AI, and cs.CV

Abstract: We propose a robot learning method for communicating, planning, and executing a wide range of tasks, dubbed This&That. We achieve robot planning for general tasks by leveraging the power of video generative models trained on internet-scale data containing rich physical and semantic context. In this work, we tackle three fundamental challenges in video-based planning: 1) unambiguous task communication with simple human instructions, 2) controllable video generation that respects user intents, and 3) translating visual planning into robot actions. We propose language-gesture conditioning to generate videos, which is both simpler and clearer than existing language-only methods, especially in complex and uncertain environments. We then suggest a behavioral cloning design that seamlessly incorporates the video plans. This&That demonstrates state-of-the-art effectiveness in addressing the above three challenges, and justifies the use of video generation as an intermediate representation for generalizable task planning and execution. Project website: https://cfeng16.github.io/this-and-that/.

PDF HTML Abstract

Language-Gesture Controlled Video Generation for Robot Planning

The research paper introduces the "This{content}That" framework, presenting a robot learning methodology that integrates language-gesture controls for enhanced video generation in robot planning. The framework leverages video generative models to address challenges in unambiguous task communication, controllable video generation, and the translation of visual plans into robotic actions. By combining language-gesture conditioning with behavioral cloning for robot execution, the paper claims state-of-the-art performance in planning tasks across complex environments.

The core of "This{content}That" is its innovative use of video diffusion models (VDM), particularly adapted from a large-scale text-to-video diffusion model (SVD) pre-trained on extensive internet data. The modifications involve a unique language-gesture conditioning approach, which surpasses the clarity and precision of language-only methods, especially within complex scenarios. This enhancement allows the generation of video sequences that align closely with human intent, using simple deictic expressions like "this" and "that."

The paper's experimental results, conducted on datasets such as Bridge and IsaacGym simulation, demonstrate the framework's effectiveness. The VDM, refined for robotics, showed superior fidelity and user alignment in video generation compared to existing methods, including AVDC, StreamingT2V, and DragAnything. Notably, the incorporation of gestures alongside natural language commands enables more precise interaction in tasks involving spatial complexity, such as "pick and place" or "stacking."

The proposed behavioral cloning model, DiVA, operates by referencing video frames generated by the VDM, incorporating them into a Transformer-based architecture. This model facilitates the seamless conversion of video plans into robotic actions. DiVA's adaptability was tested in synthetic environments, showcasing its robustness in handling out-of-distribution scenarios where language ambiguities are significant. Such advancements herald promising prospects for multi-task policy learning, underlining a significant contribution to the intersection of generative models and robotics.

The methodological contribution and empirical validation presented in this research hold substantial implications for AI's future trajectory. By refining interaction modes between humans and machines and enhancing task flexibility, "This{content}That" could significantly impact real-world robot planning and execution applications. Additionally, future exploration might focus on expanding this framework's capabilities to address long-duration tasks involving more intricate sequences of actions.

In conclusion, "This{content}That" presents a compelling advancement in robot learning via language-gesture conditioned video generation, offering profound insights and contributions that potentially drive forward the field of human-robot interaction.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Boyang Wang (39 papers)
Nikhil Sridhar (2 papers)
Chao Feng (101 papers)
Mark Van der Merwe (11 papers)
Adam Fishman (9 papers)
Nima Fazeli (38 papers)
Jeong Joon Park (24 papers)

Citations (2)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

Tweets

https://twitter.com/_vztu/status/1810763534269305304

YouTube

Show All Videos