VLM See, Robot Do: Human Demo Video to Robot Action Plan via Vision Language Model (2410.08792v1)

Published 11 Oct 2024 in cs.RO, cs.AI, cs.CV, and cs.LG

Abstract: Vision LLMs (VLMs) have recently been adopted in robotics for their capability in common sense reasoning and generalizability. Existing work has applied VLMs to generate task and motion planning from natural language instructions and simulate training data for robot learning. In this work, we explore using VLM to interpret human demonstration videos and generate robot task planning. Our method integrates keyframe selection, visual perception, and VLM reasoning into a pipeline. We named it SeeDo because it enables the VLM to ''see'' human demonstrations and explain the corresponding plans to the robot for it to ''do''. To validate our approach, we collected a set of long-horizon human videos demonstrating pick-and-place tasks in three diverse categories and designed a set of metrics to comprehensively benchmark SeeDo against several baselines, including state-of-the-art video-input VLMs. The experiments demonstrate SeeDo's superior performance. We further deployed the generated task plans in both a simulation environment and on a real robot arm.

PDF HTML Abstract

Overview of "VLM See, Robot Do: Human Demo Video to Robot Action Plan via Vision LLM"

This paper investigates the use of Vision LLMs (VLMs) to interpret human demonstration videos for robotic task planning. Traditionally, VLMs have been leveraged for textual and visual data to assist in tasks such as motion planning and language parsing. In this paper, the authors present a novel approach named "SeeDo," designed to enable VLMs to interpret long-horizon, pick-and-place human demonstration videos, thereby generating actionable plans for robotic execution.

Methodology

SeeDo integrates several key components into a cohesive pipeline that enhances the decision-making capability of VLMs when analyzing video data:

Keyframe Selection: The pipeline identifies pivotal frames within a video using hand-speed as a heuristic. This approach aims to capture the most critical moments in the task sequence, thereby condensing the data into a more manageable form for subsequent VLM analysis.
Visual Perception Module: This module enhances VLM’s visual capabilities by incorporating object detection and tracking. It leverages grounding and segmentation tools to improve the understanding of object dynamics and spatial relations within the keyframes.
VLM Reasoning: Utilizing a state-of-the-art model, GPT-4o, this module interprets the keyframes, employs chain-of-thought prompting, and compiles task plans. The output serves as intermediate LLM programs (LMPs) for task execution on robotic systems.

Experimental Design

The authors developed a benchmark involving human demonstration videos across three distinct categories: vegetable organization, garment organization, and wooden block stacking. These tasks were chosen for their inherent temporal and spatial complexity, posing a substantial challenge for both robotic planning and execution.

A new set of evaluation metrics was introduced to assess the SeeDo pipeline’s efficacy. These metrics include Task Success Rate (TSR), Final-state Success Rate (FSR), and Step Success Rate (SSR)—each highlighting different facets of the task plan execution fidelity.

Results

The experimental comparison involves both open-source and closed-source VLMs. SeeDo demonstrates superior performance across all defined metrics in comparison to baseline methodologies, including top-ranked commercial VLMs. SeeDo's robustness against visual ambiguities, as seen in the wooden block stacking tasks, also underscores the benefit of integrating visual perception enhancements.

Implications and Future Prospects

The primary contribution of SeeDo lies in its ability to close the domain gap between human demonstration videos and robotic task planning. It represents a significant stride towards practical multimodal learning models capable of understanding long-horizon tasks.

While SeeDo has shown promising results, there are notable challenges and opportunities for future research:

Action Space Expansion: Current experiments are limited to pick-and-place actions. Expanding the action repertoire remains an open area.
Spatial Intelligence: Despite advances in visual perception, further enhancements in understanding spatial relations are necessary.
Precision in Spatial Positioning: Future enhancements could involve extracting more precise spatial positioning for tasks requiring fine manipulation.

The SeeDo pipeline is a compelling framework for integrating advanced VLMs in robotic applications, bridging human demonstration with robotic execution, and enabling new possibilities in autonomous systems.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Beichen Wang (7 papers)
Juexiao Zhang (11 papers)
Shuwen Dong (1 paper)
Irving Fang (9 papers)
Chen Feng (172 papers)

Citations (4)

View on Semantic Scholar