LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action (2207.04429v2)

Published 10 Jul 2022 in cs.RO, cs.AI, cs.CL, and cs.LG

Abstract: Goal-conditioned policies for robotic navigation can be trained on large, unannotated datasets, providing for good generalization to real-world settings. However, particularly in vision-based settings where specifying goals requires an image, this makes for an unnatural interface. Language provides a more convenient modality for communication with robots, but contemporary methods typically require expensive supervision, in the form of trajectories annotated with language descriptions. We present a system, LM-Nav, for robotic navigation that enjoys the benefits of training on unannotated large datasets of trajectories, while still providing a high-level interface to the user. Instead of utilizing a labeled instruction following dataset, we show that such a system can be constructed entirely out of pre-trained models for navigation (ViNG), image-language association (CLIP), and LLMing (GPT-3), without requiring any fine-tuning or language-annotated robot data. We instantiate LM-Nav on a real-world mobile robot and demonstrate long-horizon navigation through complex, outdoor environments from natural language instructions. For videos of our experiments, code release, and an interactive Colab notebook that runs in your browser, please check out our project page https://sites.google.com/view/lmnav

Authors (4)

Dhruv Shah (48 papers)
Blazej Osinski (31 papers)
Brian Ichter (52 papers)
Sergey Levine (531 papers)

Citations (374)

View on Semantic Scholar

Summary

Overview of "LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action"

This paper presents LM-Nav, a novel approach to robotic navigation that leverages large pre-trained models across the domains of language, vision, and action to enable robust navigation based on natural language instructions. The authors bypass the typical requirement for labeled data by structuring LM-Nav entirely around models that are pre-trained on extensive datasets, including ViNG for navigation, CLIP for image-language association, and GPT-3 for LLMing. This integration circumvents the necessity for language-annotated navigation data, which has been a significant bottleneck in scalability.

LM-Nav operates in a modular fashion, where the LLM decodes user instructions into sequences of textual landmarks. These landmarks are then grounded in the robot's visual experience via a vision-and-LLM (VLM), which assesses the likelihood of image-landmark associations. The visual navigation model (VNM) is used to construct a topological map, facilitating the planning of paths through potentially complex and unstructured real-world environments. Notably, the system is designed to plan and execute navigation tasks without any fine-tuning, relying solely on the generalization capabilities of the pre-trained models.

Strong Numerical Results and System Performance

Quantitative evaluation shows that LM-Nav can efficiently execute instructions with a 85% success rate across various queries, demonstrating remarkable adaptability to the complexity of suburban environments. The system's efficacy is elucidated in practical tasks, covering distances in the order of hundreds of meters while maintaining a high degree of planning efficiency and minimal human intervention (1 intervention over 6.4 kilometers of navigation). The robustness to non-ideal real-world conditions underscores the potential of employing large pre-trained models to facilitate goal-directed robotic navigation.

Evaluation and Ablation Studies

Ablation studies reveal the significance of each pre-trained model component. For instance, replacing the ViNG-based navigation with a na\"{i}ve GPS-based method led to increased failure rates due to an inability to model obstacle traversability. Similarly, using alternative LLMs like GPT-J-6B and fairseq-13B resulted in inferior landmark extraction compared to GPT-3. This establishes GPT-3 as more adept at parsing and LLM-ViNG combinations as crucial for consistent navigation performance.

Implications and Future Directions

LM-Nav exemplifies a move towards more generalized and instruction-aware robotic systems that do not rely on environment-specific training. By utilizing sophisticated language parsing and contextual visual grounding, the paper provides a proof-of-concept for executing complex, instruction-conditioned tasks in robotics.

From a theoretical standpoint, LM-Nav challenges the conventional stratification between NLP and robotic control, presenting a cohesive model that learns to interpret and interact with the world through high-level human-like communication. Practically, the model opens the path for developing robots that can be deployed in unstructured or unfamiliar settings with a high-level directive vocabulary.

Looking ahead, there are open questions about extending such systems to integrate finer motor control commands, go beyond static landmark recognition, and handle dynamic environments more effectively. An exciting line of research would further involve crafting a unified navigation model capable of supporting a diverse set of robotic platforms in a plug-and-play manner, removing the dependencies on specific task pre-training altogether.

In conclusion, "LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action" critically advances our understanding of multi-modal robotic autonomy, effectively utilizing the capabilities of cutting-edge pre-trained models for adaptive and generalizable task execution.

PDF Markdown

Related Papers

Find Related Papers

YouTube

Show All Videos