Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 200 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 39 tok/s Pro
GPT-5 High 46 tok/s Pro
GPT-4o 130 tok/s Pro
Kimi K2 202 tok/s Pro
GPT OSS 120B 439 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Semantic Map-based Generation of Navigation Instructions (2403.19603v1)

Published 28 Mar 2024 in cs.CL, cs.AI, and cs.CV

Abstract: We are interested in the generation of navigation instructions, either in their own right or as training material for robotic navigation task. In this paper, we propose a new approach to navigation instruction generation by framing the problem as an image captioning task using semantic maps as visual input. Conventional approaches employ a sequence of panorama images to generate navigation instructions. Semantic maps abstract away from visual details and fuse the information in multiple panorama images into a single top-down representation, thereby reducing computational complexity to process the input. We present a benchmark dataset for instruction generation using semantic maps, propose an initial model and ask human subjects to manually assess the quality of generated instructions. Our initial investigations show promise in using semantic maps for instruction generation instead of a sequence of panorama images, but there is vast scope for improvement. We release the code for data preparation and model training at https://github.com/chengzu-li/VLGen.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (79)
  1. Bevbert: Multimodal map pre-training for language-guided navigation. Proceedings of the IEEE/CVF International Conference on Computer Vision.
  2. SPICE: Semantic propositional image caption evaluation. In ECCV.
  3. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  4. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan. Association for Computational Linguistics.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  6. Matterport3d: Learning from rgb-d data in indoor environments. arXiv preprint arXiv:1709.06158.
  7. Learning to explore using active neural slam. In International Conference on Learning Representations (ICLR).
  8. Touchdown: Natural language navigation and spatial reasoning in visual street environments. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12530–12539.
  9. Just ask: An interactive learning framework for vision and language navigation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 2459–2466.
  10. Unifying vision-and-language tasks via text generation. In International Conference on Machine Learning, pages 1931–1942. PMLR.
  11. A survey of natural language generation. ACM Computing Surveys, 55(8):1–38.
  12. Clip-nav: Using clip for zero-shot vision-and-language navigation. arXiv preprint arXiv:2211.16649.
  13. Uln: Towards underspecified vision-and-language navigation. arXiv preprint arXiv:2210.10020.
  14. Speaker-follower models for vision-and-language navigation. Advances in Neural Information Processing Systems, 31.
  15. Albert Gatt and Emiel Krahmer. 2017. Survey of the state of the art in natural language generation: Core tasks, applications and evaluation. arXiv preprint arXiv:1703.09902.
  16. Vision-and-language navigation: A survey of tasks, methods, and future directions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7606–7623, Dublin, Ireland. Association for Computational Linguistics.
  17. The robotic vision scene understanding challenge.
  18. Ptr: Prompt tuning with rules for text classification. AI Open, 3:182–192.
  19. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.
  20. Learning to follow directions in street view. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 11773–11781.
  21. Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718.
  22. Sub-instruction aware vision-and-language navigation. arXiv preprint arXiv:2004.02707.
  23. A recurrent vision-and-language bert for navigation. arXiv preprint arXiv:2011.13922.
  24. A recurrent vision-and-language bert for navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1643–1653.
  25. Look and answer the question: On the role of vision in embodied question answering. In Proceedings of the 15th International Conference on Natural Language Generation, pages 236–245, Waterville, Maine, USA and virtual meeting. Association for Computational Linguistics.
  26. Stay on the path: Instruction fidelity in vision-and-language navigation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1862–1872, Florence, Italy. Association for Computational Linguistics.
  27. A new path: Scaling vision-and-language navigation with synthetic instructions and imitation learning. arXiv preprint arXiv:2210.03112.
  28. Few-shot structured radiology report generation using natural language prompts. arXiv preprint arXiv:2203.15723.
  29. Maurice G Kendall. 1938. A new measure of rank correlation. Biometrika, 30(1/2):81–93.
  30. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  31. Toward interactive grounded language acqusition. In Robotics: Science and systems, volume 1, pages 721–732.
  32. Beyond the nav-graph: Vision-and-language navigation in continuous environments. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVIII 16, pages 104–120. Springer.
  33. Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding. arXiv preprint arXiv:2010.07954.
  34. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR.
  35. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557.
  36. Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190.
  37. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16, pages 121–137. Springer.
  38. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  39. Interbert: Vision-and-language interaction for multi-modal pretraining. arXiv preprint arXiv:2003.13198.
  40. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35.
  41. P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv preprint arXiv:2110.07602.
  42. Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing, 508:293–304.
  43. Mapping instructions to actions in 3D environments with visual goal prediction. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2667–2678, Brussels, Belgium. Association for Computational Linguistics.
  44. OpenAI. 2023. Gpt-4 technical report. ArXiv, abs/2303.08774.
  45. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
  46. Natural language for human-robot collaboration: Problems beyond language grounding. ArXiv, abs/2110.04441.
  47. Where do we go from here? multi-scale allocentric relational inferencefrom natural spatial descriptions. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1026–1040, St. Julian’s, Malta. Association for Computational Linguistics.
  48. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR.
  49. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
  50. Laria Reynolds and Kyle McDonell. 2021. Prompt programming for large language models: Beyond the few-shot paradigm. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, pages 1–7.
  51. Habitat: A platform for embodied ai research. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9339–9347.
  52. How much can clip benefit vision-and-language tasks? arXiv preprint arXiv:2107.06383.
  53. ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  54. V2P: Vision-to-prompt based multi-modal product summary generation. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’22, page 992–1001, New York, NY, USA. Association for Computing Machinery.
  55. Language models can do zero-shot visual referring expression comprehension. In International Conference on Learning Representations (ICLR).
  56. Hao Tan and Mohit Bansal. 2019. Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490.
  57. Learning to navigate unseen environments: Back translation with environmental dropout. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2610–2621, Minneapolis, Minnesota. Association for Computational Linguistics.
  58. Clip4caption: Clip for video caption. In Proceedings of the 29th ACM International Conference on Multimedia, pages 4858–4862.
  59. Context-tuning: Learning contextualized prompts for natural language generation. arXiv preprint arXiv:2201.08670.
  60. Zero-shot image-to-text generation for visual-semantic arithmetic. arXiv preprint arXiv:2111.14447.
  61. Attention is all you need. Advances in neural information processing systems, 30.
  62. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575.
  63. Counterfactual cycle-consistent learning for instruction following and generation in vision-language navigation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15471–15481.
  64. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, pages 23318–23340. PMLR.
  65. Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. arXiv preprint arXiv:2202.03052.
  66. Less is more: Generating grounded navigation instructions from landmarks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15428–15438.
  67. Simvlm: Simple visual language model pretraining with weak supervision. arXiv preprint arXiv:2108.10904.
  68. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
  69. Visual-and-language navigation: A survey and taxonomy. arXiv preprint arXiv:2108.11544.
  70. Progress and prospects of multimodal fusion methods in physical human–robot interaction: A review. IEEE Sensors Journal, 20(18):10355–10370.
  71. LLM-Grounder: Open-vocabulary 3d visual grounding with large language model as an agent. ArXiv, abs/2309.12311.
  72. Does vision-and-language pretraining improve lexical grounding? arXiv preprint arXiv:2109.10246.
  73. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5579–5588.
  74. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.
  75. On the evaluation of vision-and-language navigation instructions. arXiv preprint arXiv:2101.10504.
  76. Unified vision-language pre-training for image captioning and vqa. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 13041–13049.
  77. Doccoder: Generating code by retrieving and reading docs. arXiv preprint arXiv:2207.05987.
  78. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.
  79. DETRs with collaborative hybrid assignments training. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 6725–6735.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 tweet and received 3 likes.

Upgrade to Pro to view all of the tweets about this paper: