TINA: Think, Interaction, and Action Framework for Zero-Shot Vision Language Navigation (2403.08833v1)
Abstract: Zero-shot navigation is a critical challenge in Vision-Language Navigation (VLN) tasks, where the ability to adapt to unfamiliar instructions and to act in unknown environments is essential. Existing supervised learning-based models, trained using annotated data through reinforcement learning, exhibit limitations in generalization capabilities. LLMs, with their extensive knowledge and emergent reasoning abilities, present a potential pathway for achieving zero-shot navigation. This paper presents a VLN agent based on LLMs, exploring approaches to the zero-shot navigation problem. To compensate for the shortcomings of LLMs in environmental perception, we propose the Thinking, Interacting, and Action (TINA) framework. TINA enables the agent to scrutinize perceptual information and autonomously query key clues within the environment through an introduced question-answering module, thereby aligning instructions with specific perceptual data. The navigation agent's perceptual abilities are enhanced through the TINA framework, while the explicit thought and query processes also improve the navigational procedure's explainability and transparency. We evaluate the performance of our method on the Room-to-Room dataset. The experiment results indicate that our approach improves the navigation performance of LLM-based agents. Our approach also outperformed some supervised learning-based methods, highlighting its efficacy in zero-shot navigation.
- “Vqa: Visual question answering,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 2425–2433.
- “Show, attend and tell: Neural image caption generation with visual attention,” in International conference on machine learning. PMLR, 2015, pp. 2048–2057.
- “Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 3674–3683.
- “Vln bert: A recurrent vision-and-language bert for navigation,” in Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 2021, pp. 1643–1653.
- “Adapt: Vision-language navigation with modality-aligned action prompts,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 15396–15406.
- “Speaker-follower models for vision-and-language navigation,” Advances in Neural Information Processing Systems, vol. 31, 2018.
- “Learning to navigate unseen environments: Back translation with environmental dropout,” arXiv preprint arXiv:1904.04195, 2019.
- “Chasing ghosts: Instruction following as bayesian state tracking,” Advances in neural information processing systems, vol. 32, 2019.
- “An embarrassingly simple approach to zero-shot learning,” in International conference on machine learning. PMLR, 2015, pp. 2152–2161.
- “Llama: open and efficient foundation language models, 2023,” URL https://arxiv. org/abs/2302.13971, 2023.
- “Chain-of-thought prompting elicits reasoning in large language models,” Advances in Neural Information Processing Systems, vol. 35, pp. 24824–24837, 2022.
- “Emergent abilities of large language models,” arXiv preprint arXiv:2206.07682, 2022.
- “Vln-trans: Translator for the vision and language navigation agent,” arXiv preprint arXiv:2302.09230, 2023.
- “Large language models are zero-shot reasoners,” Advances in neural information processing systems, vol. 35, pp. 22199–22213, 2022.
- “a2superscript𝑎2a^{2}italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT nav: Action-aware zero-shot robot navigation by exploiting vision-and-language ability of foundation models,” arXiv preprint arXiv:2308.07997, 2023.
- “Mo-vln: A multi-task benchmark for open-set zero-shot vision-and-language navigation,” arXiv preprint arXiv:2306.10322, 2023.
- “Towards language models that can see: Computer vision through the lens of natural language,” arXiv preprint arXiv:2306.16410, 2023.
- “Internchat: Solving vision-centric tasks by interacting with chatbots beyond language,” arXiv preprint arXiv:2305.05662, 2023.
- “Minigpt-4: Enhancing vision-language understanding with advanced large language models,” arXiv preprint arXiv:2304.10592, 2023.
- “Navgpt: Explicit reasoning in vision-and-language navigation with large language models,” arXiv preprint arXiv:2305.16986, 2023.
- “History aware multimodal transformer for vision-and-language navigation,” Advances in neural information processing systems, vol. 34, pp. 5834–5847, 2021.
- “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” arXiv preprint arXiv:2301.12597, 2023.
- “End-to-end object detection with transformers,” in European conference on computer vision. Springer, 2020, pp. 213–229.
- “Masked-attention mask transformer for universal image segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 1290–1299.
- “Instruct2act: Mapping multi-modality instructions to robotic actions with large language model,” arXiv preprint arXiv:2305.11176, 2023.
- “React: Synergizing reasoning and acting in language models,” arXiv preprint arXiv:2210.03629, 2022.
- OpenAI, “Gpt-4 technical report,” 2023.
- “Langnav: Language as a perceptual representation for navigation,” arXiv preprint arXiv:2310.07889, 2023.
- Dingbang Li (1 paper)
- Wenzhou Chen (5 papers)
- Xin Lin (81 papers)