NaturalVLM: Leveraging Fine-grained Natural Language for Affordance-Guided Visual Manipulation (2403.08355v1)

Published 13 Mar 2024 in cs.RO and cs.CV

Abstract: Enabling home-assistant robots to perceive and manipulate a diverse range of 3D objects based on human language instructions is a pivotal challenge. Prior research has predominantly focused on simplistic and task-oriented instructions, i.e., "Slide the top drawer open". However, many real-world tasks demand intricate multi-step reasoning, and without human instructions, these will become extremely difficult for robot manipulation. To address these challenges, we introduce a comprehensive benchmark, NrVLM, comprising 15 distinct manipulation tasks, containing over 4500 episodes meticulously annotated with fine-grained language instructions. We split the long-term task process into several steps, with each step having a natural language instruction. Moreover, we propose a novel learning framework that completes the manipulation task step-by-step according to the fine-grained instructions. Specifically, we first identify the instruction to execute, taking into account visual observations and the end-effector's current state. Subsequently, our approach facilitates explicit learning through action-prompts and perception-prompts to promote manipulation-aware cross-modality alignment. Leveraging both visual observations and linguistic guidance, our model outputs a sequence of actionable predictions for manipulation, including contact points and end-effector poses. We evaluate our method and baselines using the proposed benchmark NrVLM. The experimental results demonstrate the effectiveness of our approach. For additional details, please refer to https://sites.google.com/view/naturalvlm.

PDF HTML Abstract

Summarize Bookmark Chat (Pro)

References (30)

Authors (5)

Ran Xu (89 papers)
Yan Shen (30 papers)
Xiaoqi Li (77 papers)
Ruihai Wu (28 papers)
Hao Dong (175 papers)

Citations (5)

View on Semantic Scholar

NaturalVLM: Leveraging Fine-grained Natural Language for Affordance-Guided Visual Manipulation (2403.08355v1)

Related Papers