Dream2Real: Zero-Shot 3D Object Rearrangement with Vision-Language Models (2312.04533v2)

Published 7 Dec 2023 in cs.RO, cs.CV, and cs.LG

Abstract: We introduce Dream2Real, a robotics framework which integrates vision-LLMs (VLMs) trained on 2D data into a 3D object rearrangement pipeline. This is achieved by the robot autonomously constructing a 3D representation of the scene, where objects can be rearranged virtually and an image of the resulting arrangement rendered. These renders are evaluated by a VLM, so that the arrangement which best satisfies the user instruction is selected and recreated in the real world with pick-and-place. This enables language-conditioned rearrangement to be performed zero-shot, without needing to collect a training dataset of example arrangements. Results on a series of real-world tasks show that this framework is robust to distractors, controllable by language, capable of understanding complex multi-object relations, and readily applicable to both tabletop and 6-DoF rearrangement tasks.

PDF Abstract

Introduction

Robots are increasingly being tested for their ability to understand and execute tasks based on human instructions. Key to this functionality is the robot's ability to "imagine" and assess different configurations of a scene that correspond to a user's commands. Robots typically learn this through extensive training on numerous examples of task-specific object arrangements. However, a new framework aims to significantly streamline and enhance this process.

Framework Overview

The framework, known as Dream2Real, empowers robots to execute tasks by physically rearranging objects in 3D space according to spoken language instructions. This is achieved by the robot's construction of an accurate 3D scene representation using Neural Radiance Fields (NeRF), allowing it to imagine various possible arrangements. A Vision-LLM (VLM) then evaluates these imagined setups, and the robot uses its best judgement to physically recreate the chosen configuration. Notable in this framework is the zero-shot capability it offers, meaning it can execute tasks correctly without having been trained on a specific dataset of object arrangements.

Methodology and Innovations

Dream2Real poses and addresses two significant questions: how a robot can visualize new scene configurations and how it can determine which of these configurations matches a given language instruction. The integration of a 3D model construction process via NeRF with a VLM capable of evaluating 2D images—a system like CLIP which understands the text-image congruency—allows the robot to consider a wide variety of tasks and objects it may never have encountered before. Additional technical contributions sharpen this process, such as the development of distraction avoidance techniques, the introduction of 'normalizing captions' to aid VLM focus, aggregation of information across various camera views, and the construction of collision meshes to preview physical feasibility.

Applicability and Strengths

Through real-world experiments, Dream2Real demonstrates an ability to control for distractors and understand complex spatial relationships between multiple objects. This underlines its adaptability to varied 3D real-world environments, such as tabletops or shelves. The framework exhibits numerous strengths: it is zero-shot, eliminating the need for curated training datasets; it executes full six degrees of freedom (6-DoF) rearrangements; and it shows superiority in using VLMs to evaluate imagined goals rather than predicting them outright.

Challenges and Future Directions

Despite these compelling features, Dream2Real does have limitations. High-tolerance tasks that require precision, like fitting pieces together, pose a challenge due to computational demands. The time needed for the robot to scan environments and process instructions also calls for refinement. Moreover, identifying improvements in VLMs that better understand spatial relationships within language is ongoing, to further reduce erroneous interpretations.

In conclusion, Dream2Real showcases an innovative leap in robotic manipulation by utilizing web-scale visual reasoning of VLMs, enabling zero-shot learning, and offering practical solutions to real-world object rearrangement challenges. Future work may focus on enhancing the sampling strategy for poses, reducing computation time, and exploring the framework's application to more complex, multi-step tasks.