Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Lana: A Language-Capable Navigator for Instruction Following and Generation (2303.08409v1)

Published 15 Mar 2023 in cs.CV, cs.MM, and cs.RO

Abstract: Recently, visual-language navigation (VLN) -- entailing robot agents to follow navigation instructions -- has shown great advance. However, existing literature put most emphasis on interpreting instructions into actions, only delivering "dumb" wayfinding agents. In this article, we devise LANA, a language-capable navigation agent which is able to not only execute human-written navigation commands, but also provide route descriptions to humans. This is achieved by simultaneously learning instruction following and generation with only one single model. More specifically, two encoders, respectively for route and language encoding, are built and shared by two decoders, respectively, for action prediction and instruction generation, so as to exploit cross-task knowledge and capture task-specific characteristics. Throughout pretraining and fine-tuning, both instruction following and generation are set as optimization objectives. We empirically verify that, compared with recent advanced task-specific solutions, LANA attains better performances on both instruction following and route description, with nearly half complexity. In addition, endowed with language generation capability, LANA can explain to humans its behaviors and assist human's wayfinding. This work is expected to foster future efforts towards building more trustworthy and socially-intelligent navigation robots.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Xiaohan Wang (91 papers)
  2. Wenguan Wang (103 papers)
  3. Jiayi Shao (5 papers)
  4. Yi Yang (856 papers)
Citations (29)

Summary

An Overview of Lana: A Language-Capable Navigator for Instruction Following and Generation

Visual-language navigation (VLN) has emerged as a significant research area in AI, focusing on enabling robots to interpret and follow human language instructions within an environment. The paper introduces a novel approach with Lana, a language-capable navigation agent designed to not only obey navigational commands but also generate descriptions of its routes, enhancing the interaction between humans and robots. This dual capability is realized using a single, integrated model.

Key Contributions and Methodology

Lana distinguishes itself from existing solutions by simultaneously addressing instruction following and language generation within a unified framework. This is accomplished through a multi-task Transformer architecture featuring two shared encoders for route and language inputs and two decoders for corresponding outputs. This design facilitates cross-task knowledge sharing and captures task-specific characteristics.

The paper outlines the process of jointly pretraining and fine-tuning the agent on both instruction following and generation tasks using prominent VLN datasets, including R2R, R4R, and REVERIE. Remarkably, Lana achieves superior performance compared to specialized, task-specific models, while maintaining nearly half the model complexity.

Experimental Results

The empirical evaluations on R2R, R4R, and REVERIE datasets demonstrate Lana's effectiveness. For instance, on the R2R test unseen split, Lana achieves a 65% Success Rate (SR) and a 60% Success weighted by Path Length (SPL), surpassing many recent models like VLN\circlearrowrightBERT and HAMT. Similarly, on the R4R dataset, Lana achieves 59.7% on the Coverage weighted by Length Score (CLS), again outperforming competitors.

In addition to instruction following, Lana's ability to generate route descriptions was assessed on textual metrics such as SPICE and CIDEr. The generated instructions were also subject to human evaluations, where Lana consistently outperformed baselines like BT-speaker and EDrop-speaker, though still not yet matching the quality of human-generated instructions.

Implications and Future Directions

Lana's approach represents a substantial advancement in creating more socially intelligent robots capable of not only executing tasks based on human instructions but also engaging in two-way communication through language generation. This capability is crucial for applications in navigation assistance for impaired individuals, public guidance robots, and more complex, interactive scenarios.

The paper suggests several avenues for future work, including the integration of larger pre-trained models to enhance Lana's linguistic and navigational capabilities further. Continued exploration in this domain is poised to align computer vision and natural language processing advancements, fostering the development of robots that can seamlessly interact with humans in naturalistic environments.