Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation (2209.05451v2)

Published 12 Sep 2022 in cs.RO, cs.AI, cs.CL, cs.CV, and cs.LG

Abstract: Transformers have revolutionized vision and natural language processing with their ability to scale with large datasets. But in robotic manipulation, data is both limited and expensive. Can manipulation still benefit from Transformers with the right problem formulation? We investigate this question with PerAct, a language-conditioned behavior-cloning agent for multi-task 6-DoF manipulation. PerAct encodes language goals and RGB-D voxel observations with a Perceiver Transformer, and outputs discretized actions by ``detecting the next best voxel action''. Unlike frameworks that operate on 2D images, the voxelized 3D observation and action space provides a strong structural prior for efficiently learning 6-DoF actions. With this formulation, we train a single multi-task Transformer for 18 RLBench tasks (with 249 variations) and 7 real-world tasks (with 18 variations) from just a few demonstrations per task. Our results show that PerAct significantly outperforms unstructured image-to-action agents and 3D ConvNet baselines for a wide range of tabletop tasks.

Authors (3)

Mohit Shridhar (14 papers)
Lucas Manuelli (10 papers)
Dieter Fox (201 papers)

Citations (393)

View on Semantic Scholar

Summary

An Expert Overview of "Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation"

The paper "Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation" introduces PerAct, a language-conditioned behavior-cloning agent that leverages a Perceiver Transformer framework for multi-task 6-DoF robotic manipulation. The authors address the challenge of applying Transformers, typically data-intensive models, to robotic domains where data collection is costly and labor-intensive. By formulating the task as a 'next-best-action' problem within a voxelized 3D space, the researchers tap into the structural efficiency of 3D observations, achieving substantial gains over conventional image-to-action methodologies.

Technical Contributions:

Voxelized Action and Observation Space: By using voxel grids instead of 2D image pixels, the paper proposes a more structured prior that supports a natural fusion of multi-view observations and 3D action learning. This approach contrasts significantly with prior attempts that often relied on unstructured image inputs.
Perceiver Transformer Application: PerAct utilizes a Perceiver Transformer, adept at managing extensive input data by encoding it into a set of latent vectors, which efficiently processes the high-dimensional input space typical in 3D manipulation tasks.
Language Conditioning: The agent incorporates language goals, enabling it to adeptly handle a variety of task objectives conditioned via natural language instructions. CLIP’s LLM is employed to encode language goals, facilitating robust semantic grounding.
Benchmarked Performance: PerAct was tested across 18 RLBench-simulated tasks and a few real-world scenarios, demonstrating superior performance, with significant improvements reported over 3D ConvNet baselines for most tasks. This showcases the agent's effectiveness in multi-task learning frameworks.

Key Findings:

PerAct outperformed baseline techniques like image-to-action agents and conventional 3D ConvNet approaches, achieving notable success rates in performing complex manipulation tasks. Specifically, it showed a marked improvement of up to $34\times$ against image-based strategies.
The agent's applicability was validated with a Franka Panda robot in real-world manipulation scenarios, obtaining high levels of success from limited demonstrations.
The experiments underscored the Perceiver's capability in maintaining a global receptive field, instrumental in distinguishing minute discrepancies across task variations, evidenced in tasks with identical visual cues such as "open drawer sub-tasks."

Limitations and Future Directions:

While achieving notable success, the research acknowledges limitations in extending PerAct to arbitrary manipulator systems, such as multi-finger hands, and its dependence on a motion planner for task execution. Future research might focus on decoupling PerAct from such dependencies, applying it to dexterous and dynamic manipulation tasks, and optimizing for more complex task hierarchies involving history and partial observability. Furthermore, adapting pre-trained visual features could lead to enhanced generalization to previously unseen objects and environments, potentially expanding PerAct's applicability within the robotics domain.

In summary, the paper presents a significant stride in the implementation of Transformer models in robotic manipulation, suggesting a pathway to more efficient, data-effective multi-task learning systems. Despite some inherent limitations, PerAct's framework establishes a potential paradigm shift in approaching multi-task challenges with advanced machine learning models in robotics.