Emergent Mind

VELMA: Verbalization Embodiment of LLM Agents for Vision and Language Navigation in Street View

(2307.06082)
Published Jul 12, 2023 in cs.AI , cs.CL , and cs.CV

Abstract

Incremental decision making in real-world environments is one of the most challenging tasks in embodied artificial intelligence. One particularly demanding scenario is Vision and Language Navigation~(VLN) which requires visual and natural language understanding as well as spatial and temporal reasoning capabilities. The embodied agent needs to ground its understanding of navigation instructions in observations of a real-world environment like Street View. Despite the impressive results of LLMs in other research areas, it is an ongoing problem of how to best connect them with an interactive visual environment. In this work, we propose VELMA, an embodied LLM agent that uses a verbalization of the trajectory and of visual environment observations as contextual prompt for the next action. Visual information is verbalized by a pipeline that extracts landmarks from the human written navigation instructions and uses CLIP to determine their visibility in the current panorama view. We show that VELMA is able to successfully follow navigation instructions in Street View with only two in-context examples. We further finetune the LLM agent on a few thousand examples and achieve 25%-30% relative improvement in task completion over the previous state-of-the-art for two datasets.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a detailed summary of this paper with a premium account.

We ran into a problem analyzing this paper.

Please try again later (sorry!).

Get summaries of trending AI papers delivered straight to your inbox

Unsubscribe anytime.

References
  1. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE conference on computer vision and pattern recognition, 3674–3683.
  2. A Priority Map for Vision-and-Language Navigation with Trajectory Plans and Feature-Location Cues. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 1094–1103.
  3. Language Models are Few-Shot Learners. In Larochelle, H.; Ranzato, M.; Hadsell, R.; Balcan, M.; and Lin, H., eds., Advances in Neural Information Processing Systems, volume 33, 1877–1901. Curran Associates, Inc.
  4. TOUCHDOWN: Natural Language Navigation and Spatial Reasoning in Visual Street Environments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, California.
  5. History aware multimodal transformer for vision-and-language navigation. Advances in neural information processing systems, 34: 5834–5847.
  6. Driving Semantic Parsing from the World’s Response. In Proceedings of the Fourteenth Conference on Computational Natural Language Learning, 18–27. Uppsala, Sweden: Association for Computational Linguistics.
  7. Clip-nav: Using clip for zero-shot vision-and-language navigation. In CoRL 2022 Workshop on Language and Robot Learning.
  8. Speaker-Follower Models for Vision-and-Language Navigation. In Neural Information Processing Systems (NeurIPS).
  9. Counterfactual vision-and-language navigation via adversarial path sampler. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16, 71–86. Springer.
  10. Learning to Follow Directions in Street View. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI). New York, New York.
  11. Vln bert: A recurrent vision-and-language bert for navigation. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 1643–1653.
  12. LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations.
  13. Simple but Effective: CLIP Embeddings for Embodied AI. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 14829–14838.
  14. Adam: A Method for Stochastic Optimization. In Proceedings of the International Conference on Learning Representations (ICLR). San Diego, California.
  15. Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 4392–4412.
  16. Envedit: Environment editing for vision-and-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15407–15417.
  17. Retouchdown: Releasing Touchdown on StreetLearn as a Public Resource for Language Grounding Tasks in Street View. In Proceedings of the Third International Workshop on Spatial Language Understanding (SpLU). Online.
  18. Mistral AI Team. 2023. Mixtral of Experts: A High Quality Sparse Mixture-of-Experts. Mistral AI Blog. Accessed: December 18
  19. GPT-4 Technical Report
  20. Reverie: Remote embodied visual referring expression in real indoor environments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9982–9991.
  21. Learning Transferable Visual Models From Natural Language Supervision. In ICML.
  22. Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
  23. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS). Fort Lauderdale, FL, USA.
  24. LAION-5B: An open large-scale dataset for training next generation image-text models. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  25. Generating Landmark Navigation Instructions from Maps as a Graph-to-Text Problem. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 489–502. Online: Association for Computational Linguistics.
  26. Analyzing Generalization of Vision and Language Navigation to Unseen Outdoor Areas. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 7519–7532. Dublin, Ireland: Association for Computational Linguistics.
  27. LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action
  28. ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  29. Outdoor Vision-and-Language Navigation Needs Object-Level Alignment. Sensors, 23(13).
  30. Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2610–2621. Minneapolis, Minnesota: Association for Computational Linguistics.
  31. LLaMA: Open and Efficient Foundation Language Models
  32. Llama 2: Open Foundation and Fine-Tuned Chat Models
  33. Voyager: An Open-Ended Embodied Agent with Large Language Models
  34. Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  35. Learning to Stop: A Simple yet Effective Approach to Urban Vision-Language Navigation. In Findings of the Association for Computational Linguistics (ACL Findings). Online.
  36. OPT: Open Pre-trained Transformer Language Models
  37. SILG: The Multi-domain Symbolic Interactive Language Grounding Benchmark. In Ranzato, M.; Beygelzimer, A.; Dauphin, Y.; Liang, P.; and Vaughan, J. W., eds., Advances in Neural Information Processing Systems, volume 34, 21505–21519. Curran Associates, Inc.
  38. NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models
  39. ESC: Exploration with Soft Commonsense Constraints for Zero-shot Object Navigation
  40. BabyWalk: Going Farther in Vision-and-Language Navigation by Taking Baby Steps. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2539–2556.
  41. Multimodal Text Style Transfer for Outdoor Vision-and-Language Navigation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 1207–1221. Online: Association for Computational Linguistics.

Show All 41

Test Your Knowledge

You answered out of questions correctly.

Well done!