Can an Embodied Agent Find Your "Cat-shaped Mug"? LLM-Guided Exploration for Zero-Shot Object Navigation (2303.03480v2)

Published 6 Mar 2023 in cs.RO, cs.AI, and cs.CL

Abstract: We present LGX (Language-guided Exploration), a novel algorithm for Language-Driven Zero-Shot Object Goal Navigation (L-ZSON), where an embodied agent navigates to a uniquely described target object in a previously unseen environment. Our approach makes use of LLMs for this task by leveraging the LLM's commonsense reasoning capabilities for making sequential navigational decisions. Simultaneously, we perform generalized target object detection using a pre-trained Vision-Language grounding model. We achieve state-of-the-art zero-shot object navigation results on RoboTHOR with a success rate (SR) improvement of over 27% over the current baseline of the OWL-ViT CLIP on Wheels (OWL CoW). Furthermore, we study the usage of LLMs for robot navigation and present an analysis of various prompting strategies affecting the model output. Finally, we showcase the benefits of our approach via \textit{real-world} experiments that indicate the superior performance of LGX in detecting and navigating to visually unique objects.

Authors (3)

Vishnu Sashank Dorbala (10 papers)
James F. Mullen Jr. (5 papers)
Dinesh Manocha (366 papers)

Citations (69)

View on Semantic Scholar

Summary

The paper introduces LGX, a novel language-guided algorithm that leverages LLMs and vision-language grounding to improve zero-shot object navigation.
The methodology combines egocentric scene understanding, intelligent LLM-driven exploration, and goal detection with GLIP for reliable object localization.
Experimental results demonstrate a 27% improvement in success rate on RoboTHOR and enhanced SPL scores compared to previous state-of-the-art methods.

Language-Guided Exploration for Zero-Shot Object Navigation

The paper explores a significant problem in robotics and artificial intelligence: enabling embodied agents to navigate complex environments and identify arbitrarily described objects using zero-shot learning capabilities without prior knowledge of the environment or the objects. This capability is crucial for real-world applications where robots must interact seamlessly with dynamic, unconstrained household environments and adapt to human instructions described in natural language.

Overview of LGX Algorithm

The authors present LGX (Language-guided Exploration), a novel algorithm tailored for Language-Driven Zero-Shot Object Goal Navigation (L-ZSON), which utilizes LLMs for commonsense reasoning and decision-making in navigation tasks. Simultaneously, LGX employs a Vision-Language grounding model to detect and localize uniquely described objects, leveraging the strengths of foundation models such as GLIP (Grounded Language-Image Pre-training).

Methodology

Scene Understanding: LGX initiates navigation by extracting visual semantic information from the environment using an egocentric view facilitated by rotation and collecting RGB and depth images. These images are processed through object detection and captioning models to derive context-rich prompts for the LLM.
Intelligent Exploration: LGX synthesizes prompts based on detected scene elements or generated captions and feeds these into the LLM. The LLM uses its inherent commonsense reasoning to provide navigational guidance by identifying sub-goals or directions closely associated with the target object.
Goal Detection and Motion Planning: GLIP actively grounds the specified target object, computing a confidence score and providing feedback on its presence in the environment. Upon achieving a confidant threshold, the agent ceases exploration and navigates directly to the detected object.

Numerical Results and Performance

LGX demonstrates superior performance in zero-shot object navigation tasks, showcasing an improvement of over 27% in success rate on RoboTHOR compared to previous state-of-the-art approaches like OWL-ViT CLIP on Wheels (OWL CoW) and CoW (CLIP on Wheels). The algorithm's success weighted by path length (SPL) metrics further corroborate its efficacy, highlighting its robustness in navigating complex, unseen environments efficiently.

Implications and Future Directions

The research delineates a path forward for integrating large-scale LLMs in robotic planning and navigation tasks, emphasizing the importance of leveraging semantic understanding and commonsense reasoning. The method presents practical implications for enhancing autonomous robotics capabilities, making them more adaptable and intelligent in interacting with human-centric environments using natural language interfaces.

Future research might explore optimizing prompt strategies for LLMs, enhancing real-world applicability, and exploring further integration with other sensory modalities such as sound and touch. The potential for collaborative models that combine several foundation models offers intriguing possibilities for richer, more nuanced interaction capabilities in AI systems operating in dynamic environments.

Overall, LGX signifies a step towards more sophisticated, adaptable autonomous systems capable of executing complex tasks described by unconstrained language, affirming its relevance in the ongoing evolution of embodied AI and navigation technologies.

PDF Markdown

Related Papers

YouTube

Show All Videos