AVID: Learning Multi-Stage Tasks via Pixel-Level Translation of Human Videos (1912.04443v3)

Published 10 Dec 2019 in cs.RO, cs.CV, and cs.LG

Abstract: Robotic reinforcement learning (RL) holds the promise of enabling robots to learn complex behaviors through experience. However, realizing this promise for long-horizon tasks in the real world requires mechanisms to reduce human burden in terms of defining the task and scaffolding the learning process. In this paper, we study how these challenges can be alleviated with an automated robotic learning framework, in which multi-stage tasks are defined simply by providing videos of a human demonstrator and then learned autonomously by the robot from raw image observations. A central challenge in imitating human videos is the difference in appearance between the human and robot, which typically requires manual correspondence. We instead take an automated approach and perform pixel-level image translation via CycleGAN to convert the human demonstration into a video of a robot, which can then be used to construct a reward function for a model-based RL algorithm. The robot then learns the task one stage at a time, automatically learning how to reset each stage to retry it multiple times without human-provided resets. This makes the learning process largely automatic, from intuitive task specification via a video to automated training with minimal human intervention. We demonstrate that our approach is capable of learning complex tasks, such as operating a coffee machine, directly from raw image observations, requiring only 20 minutes to provide human demonstrations and about 180 minutes of robot interaction.

Authors (5)

Laura Smith (20 papers)
Nikita Dhawan (7 papers)
Marvin Zhang (10 papers)
Pieter Abbeel (372 papers)
Sergey Levine (531 papers)

Citations (147)

View on Semantic Scholar

Summary

The paper introduces AVID, a framework that converts human demonstration videos into robot-compatible guidance using pixel-level CycleGAN translation.
The methodology integrates CycleGAN with structured latent variable models to decompose tasks into stages and guide effective model-based reinforcement learning.
Experimental results show that AVID achieves data-efficient, high-autonomy performance on tasks like operating a coffee machine with minimal manual oversight.

Insights on AVID: Learning Multi-Stage Tasks via Pixel-Level Translation of Human Videos

The paper from Berkeley Artificial Intelligence Research presents AVID (Automated Visual Instruction-following with Demonstrations), a novel robotic learning framework that enables robots to autonomously learn multi-stage tasks from human demonstration videos translated to robot-compatible guidance via CycleGAN. This approach mitigates the burden of manually defining task structures and crafting reward functions, a significant challenge in applying reinforcement learning (RL) to long-horizon robotic tasks.

AVID addresses a crucial challenge in the domain of robotic imitation learning: the physical and perceptual discrepancies between human demonstrations and robotic execution. Traditional methods require labor-intensive steps such as teleoperation, kinesthetic interaction, or motion capture setups to assist robots in learning tasks, often limiting scalability and flexibility. AVID circumvents these obstacles by employing pixel-level translation to convert human demonstration videos into images recognizable by the robot, thus negating the need for manual correspondence mapping.

The methodology involves training a CycleGAN to translate human videos into robot-perspective videos, followed by a representation learning phase using a structured latent variable model. This model aids in data-efficient encoding and decoding of task stages into latent space representations. Crucially, AVID capitalizes on a stage-wise approach, structuring tasks into multiple phases guided by instruction images extracted at opportune moments from demonstration sequences. The robot uses this staged decomposition to plan and refine actions across each phase using model-based RL integrated with human feedback mechanisms to signal success or directive resets whenever necessary.

AVID was tested on tasks like operating a coffee machine and retrieving a cup from a drawer using a Sawyer robotic arm. The system successfully completed tasks with a high level of autonomy, requiring minimal human intervention and surpassing several baseline methods and ablations outlined in the paper. The full-video imitation and pixel-space ablation experiments emphasize the necessity of latent space planning and stage-wise learning in effectively executing complex tasks. AVID's methodology, rooted in translating human videos, stands out for its data efficiency and limited requirement for special setups or extensive training data.

Using instruction-based task learning presents several benefits over whole-demonstration imitation methodologies, particularly in reducing compounding errors by leveraging the task’s natural decomposability. While other methods like Behavioral Cloning from Observation (BCO) struggle with generalization to multi-stage tasks from human demonstrations, AVID combines pixel-level imagery and latent space encoding to successfully navigate complex task requirements—demonstrating clear advantages over direct imitation strategies from raw demonstrations.

This paper offers several implications for both theoretical and practical realms in robotics. Theoretically, the work underscores the potential of combining visual translation with RL to simplify reward design and task specification in robot learning systems. Practically, by reducing human supervision demands through automation of reset tasks and feedback processes, AVID offers a scalable future path toward more adaptive and autonomously capable robotic systems. As future work, the authors suggest extending this method to incorporate multiple tasks using single CycleGAN training sets, which holds promise for enhancing the versatility and general applicability of robotic learning systems across complex environments. The progression towards a unified translational model capable of supporting an array of tasks without retraining presents a rich avenue for exploration, aiming at a broader, dynamic engagement of robots in everyday human environments.

AVID exemplifies a significant step in leveraging human-like learning paradigms for robotic agents and sets the stage for further refining and scaling autonomous task-learning methodologies, bringing robots closer to seamlessly integrating into everyday human tasks with minimal setup and intervention.

PDF Markdown

Related Papers

YouTube

Show All Videos