Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 26 tok/s Pro
GPT-5 High 35 tok/s Pro
GPT-4o 99 tok/s Pro
Kimi K2 192 tok/s Pro
GPT OSS 120B 440 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

VELMA: Verbalization Embodiment of LLM Agents for Vision and Language Navigation in Street View (2307.06082v2)

Published 12 Jul 2023 in cs.AI, cs.CL, and cs.CV

Abstract: Incremental decision making in real-world environments is one of the most challenging tasks in embodied artificial intelligence. One particularly demanding scenario is Vision and Language Navigation~(VLN) which requires visual and natural language understanding as well as spatial and temporal reasoning capabilities. The embodied agent needs to ground its understanding of navigation instructions in observations of a real-world environment like Street View. Despite the impressive results of LLMs in other research areas, it is an ongoing problem of how to best connect them with an interactive visual environment. In this work, we propose VELMA, an embodied LLM agent that uses a verbalization of the trajectory and of visual environment observations as contextual prompt for the next action. Visual information is verbalized by a pipeline that extracts landmarks from the human written navigation instructions and uses CLIP to determine their visibility in the current panorama view. We show that VELMA is able to successfully follow navigation instructions in Street View with only two in-context examples. We further finetune the LLM agent on a few thousand examples and achieve 25%-30% relative improvement in task completion over the previous state-of-the-art for two datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE conference on computer vision and pattern recognition, 3674–3683.
  2. A Priority Map for Vision-and-Language Navigation with Trajectory Plans and Feature-Location Cues. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 1094–1103.
  3. Language Models are Few-Shot Learners. In Larochelle, H.; Ranzato, M.; Hadsell, R.; Balcan, M.; and Lin, H., eds., Advances in Neural Information Processing Systems, volume 33, 1877–1901. Curran Associates, Inc.
  4. TOUCHDOWN: Natural Language Navigation and Spatial Reasoning in Visual Street Environments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, California.
  5. History aware multimodal transformer for vision-and-language navigation. Advances in neural information processing systems, 34: 5834–5847.
  6. Driving Semantic Parsing from the World’s Response. In Proceedings of the Fourteenth Conference on Computational Natural Language Learning, 18–27. Uppsala, Sweden: Association for Computational Linguistics.
  7. Clip-nav: Using clip for zero-shot vision-and-language navigation. In CoRL 2022 Workshop on Language and Robot Learning.
  8. Speaker-Follower Models for Vision-and-Language Navigation. In Neural Information Processing Systems (NeurIPS).
  9. Counterfactual vision-and-language navigation via adversarial path sampler. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16, 71–86. Springer.
  10. Learning to Follow Directions in Street View. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI). New York, New York.
  11. Vln bert: A recurrent vision-and-language bert for navigation. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 1643–1653.
  12. LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations.
  13. Simple but Effective: CLIP Embeddings for Embodied AI. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 14829–14838.
  14. Adam: A Method for Stochastic Optimization. In Proceedings of the International Conference on Learning Representations (ICLR). San Diego, California.
  15. Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 4392–4412.
  16. Envedit: Environment editing for vision-and-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15407–15417.
  17. Retouchdown: Releasing Touchdown on StreetLearn as a Public Resource for Language Grounding Tasks in Street View. In Proceedings of the Third International Workshop on Spatial Language Understanding (SpLU). Online.
  18. Mistral AI Team. 2023. Mixtral of Experts: A High Quality Sparse Mixture-of-Experts. Mistral AI Blog. Accessed: December 18, 2023.
  19. OpenAI. 2023. GPT-4 Technical Report. ArXiv, abs/2303.08774.
  20. Reverie: Remote embodied visual referring expression in real indoor environments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9982–9991.
  21. Learning Transferable Visual Models From Natural Language Supervision. In ICML.
  22. Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
  23. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS). Fort Lauderdale, FL, USA.
  24. LAION-5B: An open large-scale dataset for training next generation image-text models. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  25. Generating Landmark Navigation Instructions from Maps as a Graph-to-Text Problem. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 489–502. Online: Association for Computational Linguistics.
  26. Analyzing Generalization of Vision and Language Navigation to Unseen Outdoor Areas. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 7519–7532. Dublin, Ireland: Association for Computational Linguistics.
  27. LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action. arXiv:2207.04429.
  28. ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  29. Outdoor Vision-and-Language Navigation Needs Object-Level Alignment. Sensors, 23(13).
  30. Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2610–2621. Minneapolis, Minnesota: Association for Computational Linguistics.
  31. LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971.
  32. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288.
  33. Voyager: An Open-Ended Embodied Agent with Large Language Models. arXiv preprint arXiv: Arxiv-2305.16291.
  34. Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  35. Learning to Stop: A Simple yet Effective Approach to Urban Vision-Language Navigation. In Findings of the Association for Computational Linguistics (ACL Findings). Online.
  36. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
  37. SILG: The Multi-domain Symbolic Interactive Language Grounding Benchmark. In Ranzato, M.; Beygelzimer, A.; Dauphin, Y.; Liang, P.; and Vaughan, J. W., eds., Advances in Neural Information Processing Systems, volume 34, 21505–21519. Curran Associates, Inc.
  38. NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models. arXiv:2305.16986.
  39. ESC: Exploration with Soft Commonsense Constraints for Zero-shot Object Navigation. arXiv preprint arXiv:2301.13166.
  40. BabyWalk: Going Farther in Vision-and-Language Navigation by Taking Baby Steps. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2539–2556.
  41. Multimodal Text Style Transfer for Outdoor Vision-and-Language Navigation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 1207–1221. Online: Association for Computational Linguistics.
Citations (46)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 tweet and received 11 likes.

Upgrade to Pro to view all of the tweets about this paper: