Vision-Language Navigation with Embodied Intelligence: A Survey (2402.14304v2)
Abstract: As a long-term vision in the field of artificial intelligence, the core goal of embodied intelligence is to improve the perception, understanding, and interaction capabilities of agents and the environment. Vision-language navigation (VLN), as a critical research path to achieve embodied intelligence, focuses on exploring how agents use natural language to communicate effectively with humans, receive and understand instructions, and ultimately rely on visual information to achieve accurate navigation. VLN integrates artificial intelligence, natural language processing, computer vision, and robotics. This field faces technical challenges but shows potential for application such as human-computer interaction. However, due to the complex process involved from language understanding to action execution, VLN faces the problem of aligning visual information and language instructions, improving generalization ability, and many other challenges. This survey systematically reviews the research progress of VLN and details the research direction of VLN with embodied intelligence. After a detailed summary of its system architecture and research based on methods and commonly used benchmark datasets, we comprehensively analyze the problems and challenges faced by current research and explore the future development direction of this field, aiming to provide a practical reference for researchers.
- Y. Zhang, “A historical review and philosophical examination of the two paradigms in artificial intelligence research,” European Journal of Artificial Intelligence and Machine Learning, vol. 2, no. 2, pp. 24–32, 2023.
- K. Agrawal, “To study the phenomenon of the moravec’s paradox,” arXiv preprint arXiv:1012.3148, 2010.
- P. Husbands, Y. Shim, M. Garvie, A. Dewar, N. Domcsek, P. Graham, J. Knight, T. Nowotny, and A. Philippides, “Recent advances in evolutionary and bio-inspired adaptive robotics: Exploiting embodied dynamics,” Applied Intelligence, vol. 51, no. 9, pp. 6467–6496, 2021.
- R. Chrisley, “Embodied artificial intelligence,” Artificial intelligence, vol. 149, no. 1, pp. 131–150, 2003.
- M. L. Anderson, “Embodied cognition: A field guide,” Artificial intelligence, vol. 149, no. 1, pp. 91–130, 2003.
- A. Miriyev and M. Kovač, “Skills for physical artificial intelligence,” Nature Machine Intelligence, vol. 2, no. 11, pp. 658–660, 2020.
- N. Roy, I. Posner, T. Barfoot, P. Beaudoin, Y. Bengio, J. Bohg, O. Brock, I. Depatie, D. Fox, D. Koditschek et al., “From machine learning to robotics: challenges and opportunities for embodied intelligence,” arXiv preprint arXiv:2110.15245, 2021.
- L. Shapiro and S. A. Stolz, “Embodied cognition and its significance for education,” Theory and Research in Education, vol. 17, no. 1, pp. 19–39, 2019.
- D. Howard, A. E. Eiben, D. F. Kennedy, J.-B. Mouret, P. Valencia, and D. Winkler, “Evolving embodied intelligence from materials to machines,” Nature Machine Intelligence, vol. 1, no. 1, pp. 12–19, 2019.
- C. A. Aubin, B. Gorissen, E. Milana, P. R. Buskohl, N. Lazarus, G. A. Slipher, C. Keplinger, J. Bongard, F. Iida, J. A. Lewis, and R. F. Shepherd, “Towards enduring autonomous robots via embodied energy,” Nature, vol. 602, no. 7897, pp. 393–402, 2022.
- J. Thomason, M. Murray, M. Cakmak, and L. Zettlemoyer, “Vision-and-dialog navigation,” in Proceedings of the Conference on Robot Learning. PMLR, 2020, pp. 394–406.
- P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sunderhauf, I. Reid, S. Gould, and A. Van Den Hengel, “Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2018, pp. 3674–3683.
- H. Chen, A. Suhr, D. Misra, N. Snavely, and Y. Artzi, “Touchdown: Natural language navigation and spatial reasoning in visual street environments,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 12 530–12 539.
- J. J. Gao and J.-Y. T. Pan, “Mulea’19: The first international workshop on multimodal understanding and learning for embodied applications,” in Proceedings of the 27th ACM International Conference on Multimedia. ACM, 2019, pp. 2734–2735.
- Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and A. Farhadi, “Target-driven visual navigation in indoor scenes using deep reinforcement learning,” in 2017 IEEE international conference on robotics and automation (ICRA). IEEE, 2017, pp. 3357–3364.
- P. Mirowski, M. Grimes, M. Malinowski, K. M. Hermann, K. Anderson, D. Teplyashin, K. Simonyan, k. kavukcuoglu, A. Zisserman, and R. Hadsell, “Learning to navigate in cities without a map,” in Advances in Neural Information Processing Systems, vol. 31, 2018.
- M. Shridhar, J. Thomason, D. Gordon, Y. Bisk, W. Han, R. Mottaghi, L. Zettlemoyer, and D. Fox, “Alfred: A benchmark for interpreting grounded instructions for everyday tasks,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2020, pp. 10 737–10 746.
- X. Ye and Y. Yang, “From seeing to moving: A survey on learning for visual indoor navigation (vin),” arXiv preprint arXiv:2002.11310, 2020.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2016, pp. 770–778.
- J. Redmon and A. Farhadi, “Yolo9000: Better, faster, stronger,” in IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2017, pp. 6517–6525.
- K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961–2969.
- A. CONNEAU and G. Lample, “Cross-lingual language model pretraining,” in Advances in Neural Information Processing Systems, vol. 32, 2019.
- J. D. M.-W. C. Kenton and L. K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of naacL-HLT, vol. 1, 2019, p. 2.
- A. Moudgil, A. Majumdar, H. Agrawal, S. Lee, and D. Batra, “Soat: A scene- and object-aware transformer for vision-and-language navigation,” in Advances in Neural Information Processing Systems, vol. 34, 2021, pp. 7357–7367.
- W. Rawat and Z. Wang, “Deep convolutional neural networks for image classification: A comprehensive review,” Neural Computation, vol. 29, no. 9, pp. 2352–2449, 2017.
- Y. Yu, X. Si, C. Hu, and J. Zhang, “A review of recurrent neural networks: Lstm cells and network architectures,” Neural Computation, vol. 31, no. 7, pp. 1235–1270, 2019.
- A. Radford, K. Narasimhan, T. Salimans, I. Sutskever et al., “Improving language understanding by generative pre-training,” 2018.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, u. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, vol. 30, 2017.
- A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth, “Every picture tells a story: Generating sentences from images,” in Computer Vision–ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part IV 11. Springer, 2010, pp. 15–29.
- A. Singh, V. Natarajan, M. Shah, Y. Jiang, X. Chen, D. Batra, D. Parikh, and M. Rohrbach, “Towards vqa models that can read,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2019, pp. 8309–8318.
- S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, “Vqa: Visual question answering,” in 2015 IEEE International Conference on Computer Vision (ICCV). IEEE, 2015, pp. 2425–2433.
- A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niebner, M. Savva, S. Song, A. Zeng, and Y. Zhang, “Matterport3d: Learning from rgb-d data in indoor environments,” in 2017 International Conference on 3D Vision (3DV), 2017, pp. 667–676.
- D. Fried, R. Hu, V. Cirik, A. Rohrbach, J. Andreas, L.-P. Morency, T. Berg-Kirkpatrick, K. Saenko, D. Klein, and T. Darrell, “Speaker-follower models for vision-and-language navigation,” in Advances in Neural Information Processing Systems, vol. 31, 2018.
- C.-Y. Ma, J. Lu, Z. Wu, G. AlRegib, Z. Kira, R. Socher, and C. Xiong, “Self-monitoring navigation agent via auxiliary progress estimation,” arXiv preprint arXiv:1901.03035, 2019.
- X. Wang, Q. Huang, A. Celikyilmaz, J. Gao, D. Shen, Y.-F. Wang, W. Y. Wang, and L. Zhang, “Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 6629–6638.
- C.-W. Kuo, C.-Y. Ma, J. Hoffman, and Z. Kira, “Structure-encoding auxiliary tasks for improved visual representation in vision-and-language navigation,” in 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2023, pp. 1104–1113.
- C. Xu, H. T. Nguyen, C. Amato, and L. L. Wong, “Vision and language navigation in the real world via online visual language mapping,” arXiv preprint arXiv:2310.10822, 2023.
- K. Lin, P. Chen, D. Huang, T. H. Li, M. Tan, and C. Gan, “Learning vision-and-language navigation from youtube videos,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 8317–8326.
- Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
- Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
- K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
- T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.
- Q. Le and T. Mikolov, “Distributed representations of sentences and documents,” in International conference on machine learning. PMLR, 2014, pp. 1188–1196.
- S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
- M. Blösch, S. Weiss, D. Scaramuzza, and R. Siegwart, “Vision based mav navigation in unknown and unstructured environments,” in 2010 IEEE International Conference on Robotics and Automation. IEEE, 2010, pp. 21–28.
- B. Sinopoli, M. Micheli, G. Donato, and T. Koo, “Vision based navigation for an unmanned aerial vehicle,” in Proceedings 2001 ICRA. IEEE International Conference on Robotics and Automation (Cat. No.01CH37164), vol. 2. IEEE, 2001, pp. 1757–1764.
- A. Vogel and D. Jurafsky, “Learning to follow navigational directions,” in Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2010, pp. 806–814.
- M. MacMahon, B. Stankiewicz, and B. Kuipers, “Walk the talk: connecting language, knowledge, and action in route instructions,” in Proceedings of the 21st national conference on Artificial intelligence-Volume 2, 2006, pp. 1475–1482.
- D. A. Hudson and C. D. Manning, “Gqa: A new dataset for real-world visual reasoning and compositional question answering,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2019, pp. 6693–6702.
- J. Johnson, B. Hariharan, L. Van Der Maaten, L. Fei-Fei, C. L. Zitnick, and R. Girshick, “Clevr: A diagnostic dataset for compositional language and elementary visual reasoning,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017, pp. 1988–1997.
- Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh, “Making the v in vqa matter: Elevating the role of image understanding in visual question answering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 6904–6913.
- X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick, “Microsoft coco captions: Data collection and evaluation server,” arXiv preprint arXiv:1504.00325, 2015.
- P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions,” Transactions of the Association for Computational Linguistics, vol. 2, pp. 67–78, 2014.
- P. Sharma, N. Ding, S. Goodman, and R. Soricut, “Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2018, pp. 2556–2565.
- D. Gao, R. Wang, S. Shan, and X. Chen, “Cric: A vqa dataset for compositional reasoning on vision and commonsense,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 5, pp. 5561–5578, 2022.
- R. Zellers, Y. Bisk, A. Farhadi, and Y. Choi, “From recognition to cognition: Visual commonsense reasoning,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2019, pp. 6713–6724.
- Y. Qi, Z. Pan, Y. Hong, M.-H. Yang, A. Van Den Hengel, and Q. Wu, “The road to know-where: An object-and-room informed sequential bert for indoor vision-language navigation,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2021, pp. 1635–1644.
- V. Jain, G. Magalhaes, A. Ku, A. Vaswani, E. Ie, and J. Baldridge, “Stay on the path: Instruction fidelity in vision-and-language navigation,” arXiv preprint arXiv:1905.12255, 2019.
- A. Ku, P. Anderson, R. Patel, E. Ie, and J. Baldridge, “Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding,” arXiv preprint arXiv:2010.07954, 2020.
- A. Yan, X. E. Wang, J. Feng, L. Li, and W. Y. Wang, “Cross-lingual vision-language navigation,” arXiv preprint arXiv:1910.11301, 2019.
- K. He, Y. Huang, Q. Wu, J. Yang, D. An, S. Sima, and L. Wang, “Landmark-rxr: Solving vision-and-language navigation with fine-grained alignment supervision,” in Advances in Neural Information Processing Systems, vol. 34, 2021, pp. 652–663.
- J. Krantz, E. Wijmans, A. Majumdar, D. Batra, and S. Lee, “Beyond the nav-graph: Vision-and-language navigation in continuous environments,” in Computer Vision–ECCV 2020. Springer, 2020, pp. 104–120.
- P.-L. Guhur, M. Tapaswi, S. Chen, I. Laptev, and C. Schmid, “Airbert: In-domain pretraining for vision-and-language navigation,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2021, pp. 1614–1623.
- P. Mirowski, A. Banki-Horvath, K. Anderson, D. Teplyashin, K. M. Hermann, M. Malinowski, M. K. Grimes, K. Simonyan, K. Kavukcuoglu, A. Zisserman et al., “The streetlearn environment and dataset,” arXiv preprint arXiv:1903.01292, 2019.
- K. M. Hermann, M. Malinowski, P. Mirowski, A. Banki-Horvath, K. Anderson, and R. Hadsell, “Learning to follow directions in street view,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, pp. 11 773–11 781, 2020.
- A. B. Vasudevan, D. Dai, and L. Van Gool, “Talk2nav: Long-range vision-and-language navigation with dual attention and spatial memory,” International Journal of Computer Vision, vol. 129, no. 1, pp. 246–266, 2021.
- D. Misra, A. Bennett, V. Blukis, E. Niklasson, M. Shatkhin, and Y. Artzi, “Mapping instructions to actions in 3d environments with visual goal prediction,” arXiv preprint arXiv:1809.00786, 2018.
- F. Zhu, X. Liang, Y. Zhu, Q. Yu, X. Chang, and X. Liang, “Soon: Scenario oriented object navigation with graph-based exploration,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2021, pp. 12 684–12 694.
- A. Das, S. Datta, G. Gkioxari, S. Lee, D. Parikh, and D. Batra, “Embodied question answering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 1–10.
- Y. Qi, Q. Wu, P. Anderson, X. Wang, W. Y. Wang, C. Shen, and A. Van Den Hengel, “Reverie: Remote embodied visual referring expression in real indoor environments,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2020, pp. 9979–9988.
- T.-C. Chi, M. Shen, M. Eric, S. Kim, and D. Hakkani-tur, “Just ask: An interactive learning framework for vision and language navigation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 03, 2020, pp. 2459–2466.
- A. Suhr, C. Yan, J. Schluger, S. Yu, H. Khader, M. Mouallem, I. Zhang, and Y. Artzi, “Executing instructions in situated collaborative interactions,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, 2019, pp. 2119–2130.
- K. Nguyen and H. Daumé III, “Help, anna! visual navigation with natural multimodal assistance via retrospective curiosity-encouraging imitation learning,” arXiv preprint arXiv:1909.01871, 2019.
- H. De Vries, K. Shuster, D. Batra, D. Parikh, J. Weston, and D. Kiela, “Talk the walk: Navigating new york city through grounded dialogue,” arXiv preprint arXiv:1807.03367, 2018.
- S. Banerjee, J. Thomason, and J. Corso, “The robotslang benchmark: Dialog-guided robot localization and navigation,” in Proceedings of the 2020 Conference on Robot Learning. PMLR, 2021, pp. 1384–1393.
- X. Gao, Q. Gao, R. Gong, K. Lin, G. Thattai, and G. S. Sukhatme, “Dialfred: Dialogue-enabled agents for embodied instruction following,” IEEE Robotics and Automation Letters, vol. 7, no. 4, pp. 10 049–10 056, 2022.
- A. Narayan-Chen, P. Jayannavar, and J. Hockenmaier, “Collaborative dialogue in minecraft,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2019, pp. 5405–5415.
- Y. Hong, C. Rodriguez-Opazo, Q. Wu, and S. Gould, “Sub-instruction aware vision-and-language navigation,” arXiv preprint arXiv:2004.02707, 2020.
- W. Zhu, H. Hu, J. Chen, Z. Deng, V. Jain, E. Ie, and F. Sha, “Babywalk: Going farther in vision-and-language navigation by taking baby steps,” arXiv preprint arXiv:2005.04625, 2020.
- M. Z. Irshad, C.-Y. Ma, and Z. Kira, “Hierarchical cross-modal agent for robotics vision-and-language navigation,” in 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021, pp. 13 238–13 246.
- H. Mehta, Y. Artzi, J. Baldridge, E. Ie, and P. Mirowski, “Retouchdown: Releasing touchdown on streetlearn as a public resource for language grounding tasks in street view,” in Proceedings of the Third International Workshop on Spatial Language Understanding. Association for Computational Linguistics, 2020, pp. 56–62.
- Y. Wu, Y. Wu, G. Gkioxari, and Y. Tian, “Building generalizable agents with a realistic and rich 3d environment,” arXiv preprint arXiv:1801.02209, 2018.
- E. Kolve, R. Mottaghi, W. Han, E. VanderBilt, L. Weihs, A. Herrasti, M. Deitke, K. Ehsani, D. Gordon, Y. Zhu et al., “Ai2-thor: An interactive 3d environment for visual ai,” arXiv preprint arXiv:1712.05474, 2017.
- K. Nguyen, D. Dey, C. Brockett, and B. Dolan, “Vision-based navigation with language-based assistance via imitation learning with indirect intervention,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2019, pp. 12 519–12 529.
- A. Suhr, C. Yan, C. Schluger, S. Yu, H. Khader, M. Mouallem, I. Zhang, and Y. Artzi, “Executing instructions in situated collaborative interactions,” arXiv preprint arXiv:1910.03655, 2019.
- A. Padmakumar, J. Thomason, A. Shrivastava, P. Lange, A. Narayan-Chen, S. Gella, R. Piramuthu, G. Tur, and D. Hakkani-Tur, “Teach: Task-driven embodied agents that chat,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 2, pp. 2017–2025, 2022.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
- E. Wijmans, A. Kadian, A. Morcos, S. Lee, I. Essa, D. Parikh, M. Savva, and D. Batra, “Dd-ppo: Learning near-perfect pointgoal navigators from 2.5 billion frames,” arXiv preprint arXiv:1911.00357, 2019.
- M. Engelcke, A. R. Kosiorek, O. P. Jones, and I. Posner, “Genesis: Generative scene inference and sampling with object-centric latent representations,” arXiv preprint arXiv:1907.13052, 2019.
- X. Zhan, X. Pan, B. Dai, Z. Liu, D. Lin, and C. C. Loy, “Self-supervised scene de-occlusion,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2020, pp. 3783–3791.
- L. Zheng, C. Zhu, J. Zhang, H. Zhao, H. Huang, M. Niessner, and K. Xu, “Active scene understanding via online semantic reconstruction,” in Computer Graphics Forum, vol. 38, no. 7. Wiley Online Library, 2019, pp. 103–114.
- W. Qi, R. T. Mullapudi, S. Gupta, and D. Ramanan, “Learning to move with affordance maps,” arXiv preprint arXiv:2001.02364, 2020.
- B. Chen, A. Sax, G. Lewis, I. Armeni, S. Savarese, A. Zamir, J. Malik, and L. Pinto, “Robust policies via mid-level visual representations: An experimental study in manipulation and navigation,” arXiv preprint arXiv:2011.06698, 2020.
- S. Tan, H. Liu, D. Guo, X. Zhang, and F. Sun, “Towards embodied scene description,” arXiv preprint arXiv:2004.14638, 2020.
- R. Schumann and S. Riezler, “Analyzing generalization of vision and language navigation to unseen outdoor areas,” arXiv preprint arXiv:2203.13838, 2022.
- J. Thomason, D. Gordon, and Y. Bisk, “Shifting the baseline: Single modality performance on visual navigation & qa,” arXiv preprint arXiv:1811.00613, 2018.
- R. Hu, D. Fried, A. Rohrbach, D. Klein, T. Darrell, and K. Saenko, “Are you looking? grounding to multiple modalities in vision-and-language navigation,” arXiv preprint arXiv:1906.00347, 2019.
- Y. Zhang, H. Tan, and M. Bansal, “Diagnosing the environment bias in vision-and-language navigation,” in Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, 2021, pp. 890–897.
- W. Zhu, Y. Qi, P. Narayana, K. Sone, S. Basu, X. E. Wang, Q. Wu, M. Eckstein, and W. Y. Wang, “Diagnosing vision-and-language navigation: What really matters,” arXiv preprint arXiv:2103.16561, 2021.
- Y. Qi, Z. Pan, S. Zhang, A. van den Hengel, and Q. Wu, “Object-and-action aware model for visual language navigation,” in European Conference on Computer Vision. Springer, 2020, pp. 303–317.
- C. Gao, J. Chen, S. Liu, L. Wang, Q. Zhang, and Q. Wu, “Room-and-object aware knowledge reasoning for remote embodied referring expression,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2021, pp. 3063–3072.
- W. Zhang, C. Ma, Q. Wu, and X. Yang, “Language-guided navigation via cross-modal grounding and alternate adversarial learning,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 9, pp. 3469–3481, 2021.
- F. Landi, L. Baraldi, M. Corsini, and R. Cucchiara, “Embodied vision-and-language navigation with dynamic convolutional filters,” arXiv preprint arXiv:1907.02985, 2019.
- J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks,” in Advances in Neural Information Processing Systems, vol. 32, 2019.
- H. Kim, J. Li, and M. Bansal, “Ndh-full: Learning and evaluating navigational agents on full-length dialogue,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2021, pp. 6432–6442.
- V. Blukis, N. Brukhim, A. Bennett, R. A. Knepper, and Y. Artzi, “Following high-level navigation instructions on a simulated quadcopter with imitation learning,” arXiv preprint arXiv:1806.00047, 2018.
- P. Goyal, S. Niekum, and R. Mooney, “Pixl2r: Guiding reinforcement learning using natural language by mapping pixels to rewards,” in Proceedings of the 2020 Conference on Robot Learning. PMLR, 2021, pp. 485–497.
- F. Sammani and L. Melas-Kyriazi, “Show, edit and tell: A framework for editing image captions,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2020, pp. 4807–4815.
- M. Prabhudesai, H.-Y. F. Tung, S. A. Javed, M. Sieb, A. W. Harley, and K. Fragkiadaki, “Embodied language grounding with 3d visual feature representations,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2020, pp. 2217–2226.
- H. Tan and M. Bansal, “Vokenization: Improving language understanding with contextualized, visual-grounded supervision,” arXiv preprint arXiv:2010.06775, 2020.
- Y. Guo, Z. Cheng, L. Nie, Y. Liu, Y. Wang, and M. Kankanhalli, “Quantifying and alleviating the language prior problem in visual question answering,” in Proceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval, 2019, pp. 75–84.
- F. Landi, L. Baraldi, M. Cornia, M. Corsini, and R. Cucchiara, “Multimodal attention networks for low-level vision-and-language navigation,” Computer vision and image understanding, vol. 210, p. 103255, 2021.
- Y.-H. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L.-P. Morency, and R. Salakhutdinov, “Multimodal transformer for unaligned multimodal language sequences,” in Proceedings of the conference. Association for Computational Linguistics. Meeting, vol. 2019. NIH Public Access, 2019, p. 6558.
- H. R. Vaezi Joze, A. Shaban, M. L. Iuzzolino, and K. Koishida, “Mmtm: Multimodal transfer module for cnn fusion,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2020, pp. 13 286–13 296.
- X. Mou, B. Sigouin, I. Steenstra, and H. Su, “Multimodal dialogue state tracking by qa approach with data augmentation,” arXiv preprint arXiv:2007.09903, 2020.
- Z. Deng, K. Narasimhan, and O. Russakovsky, “Evolving graphical planner: Contextual global planning for vision-and-language navigation,” in Advances in Neural Information Processing Systems, vol. 33, 2020, pp. 20 660–20 672.
- Y. Hong, C. Rodriguez, Y. Qi, Q. Wu, and S. Gould, “Language and visual entity relationship graph for agent navigation,” in Advances in Neural Information Processing Systems, vol. 33, 2020, pp. 7685–7696.
- P. Anderson, A. Shrivastava, D. Parikh, D. Batra, and S. Lee, “Chasing ghosts: Instruction following as bayesian state tracking,” in Advances in Neural Information Processing Systems, vol. 32, 2019.
- K. Chen, J. K. Chen, J. Chuang, M. Vazquez, and S. Savarese, “Topological planning with transformers for vision-and-language navigation,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2021, pp. 11 271–11 281.
- Y. Hong, Q. Wu, Y. Qi, C. Rodriguez-Opazo, and S. Gould, “Vlnbert: A recurrent vision-and-language bert for navigation,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2021, pp. 1643–1653.
- Y. Zhu, F. Zhu, Z. Zhan, B. Lin, J. Jiao, X. Chang, and X. Liang, “Vision-dialog navigation by exploring cross-modal memory,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10 730–10 739.
- X. Lin, G. Li, and Y. Yu, “Scene-intuitive agent for remote embodied visual grounding,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2021, pp. 7032–7041.
- S. Chen, P.-L. Guhur, C. Schmid, and I. Laptev, “History aware multimodal transformer for vision-and-language navigation,” in Advances in Neural Information Processing Systems, vol. 34. Curran Associates, Inc., 2021, pp. 5834–5847.
- F. Zhu, Y. Zhu, X. Chang, and X. Liang, “Vision-language navigation with self-supervised auxiliary reasoning tasks,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2020, pp. 10 009–10 019.
- H. Huang, V. Jain, H. Mehta, A. Ku, G. Magalhaes, J. Baldridge, and E. Ie, “Transferable representation learning in vision-and-language navigation,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2019, pp. 7403–7412.
- Z. Li, Z. Li, J. Zhang, Y. Feng, and J. Zhou, “Bridging text and video: A universal multimodal transformer for audio-visual scene-aware dialog,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 2476–2483, 2021.
- M. Rao, A. Raju, P. Dheram, B. Bui, and A. Rastrow, “Speech to semantics: Improve asr and nlu jointly via all-neural interfaces,” arXiv preprint arXiv:2008.06173, 2020.
- M. Shridhar, X. Yuan, M.-A. Côté, Y. Bisk, A. Trischler, and M. Hausknecht, “Alfworld: Aligning text and embodied environments for interactive learning,” arXiv preprint arXiv:2010.03768, 2020.
- A. Majumdar, A. Shrivastava, S. Lee, P. Anderson, D. Parikh, and D. Batra, “Improving vision-and-language navigation with image-text pairs from the web,” in Computer Vision–ECCV 2020. Springer, 2020, pp. 259–274.
- W. Hao, C. Li, X. Li, L. Carin, and J. Gao, “Towards learning a generic agent for vision-and-language navigation via pre-training,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2020, pp. 13 134–13 143.
- W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai, “Vl-bert: Pre-training of generic visual-linguistic representations,” arXiv preprint arXiv:1908.08530, 2019.
- H. Le and S. C. Hoi, “Video-grounded dialogues with pretrained generation language models,” arXiv preprint arXiv:2006.15319, 2020.
- Z. Sun, H. Yu, X. Song, R. Liu, Y. Yang, and D. Zhou, “Mobilebert: a compact task-agnostic bert for resource-limited devices,” arXiv preprint arXiv:2004.02984, 2020.
- V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter,” arXiv preprint arXiv:1910.01108, 2019.
- M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang, and A. Ahmed, “Big bird: Transformers for longer sequences,” in Advances in Neural Information Processing Systems, vol. 33, 2020, pp. 17 283–17 297.
- A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners.”
- M. Chevalier-Boisvert, D. Bahdanau, S. Lahlou, L. Willems, C. Saharia, T. H. Nguyen, and Y. Bengio, “Babyai: A platform to study the sample efficiency of grounded language learning,” arXiv preprint arXiv:1810.08272, 2018.
- N. Waytowich, S. L. Barton, V. Lawhern, and G. Warnell, “A narration-based reward shaping approach using grounded natural language commands,” arXiv preprint arXiv:1911.00497, 2019.
- J. Krantz, A. Gokaslan, D. Batra, S. Lee, and O. Maksymets, “Waypoint models for instruction-guided navigation in continuous environments,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2021, pp. 15 142–15 151.
- X. Wang, W. Xiong, H. Wang, and W. Y. Wang, “Look before you leap: Bridging model-free and model-based reinforcement learning for planned-ahead vision-and-language navigation,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 37–53.
- J. Xiang, X. E. Wang, and W. Y. Wang, “Learning to stop: A simple yet effective approach to urban vision-language navigation,” 2020.
- S. Kurita and K. Cho, “Generative language-grounded policy in vision-and-language navigation with bayes’ rule,” arXiv preprint arXiv:2009.07783, 2020.
- C. H. Song, J. Kil, T.-Y. Pan, B. M. Sadler, W.-L. Chao, and Y. Su, “One step at a time: Long-horizon vision-and-language navigation with milestones,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 15 482–15 491.
- S. Raychaudhuri, S. Wani, S. Patel, U. Jain, and A. X. Chang, “Language-aligned waypoint (law) supervision for vision-and-language navigation in continuous environments,” arXiv preprint arXiv:2109.15207, 2021.
- G. Ilharco, V. Jain, A. Ku, E. Ie, and J. Baldridge, “General evaluation for instruction conditioned navigation using dynamic time warping,” arXiv preprint arXiv:1907.05446, 2019.
- J. Fu, A. Korattikara, S. Levine, and S. Guadarrama, “From language to goals: Inverse reinforcement learning for vision-based instruction following,” arXiv preprint arXiv:1902.07742, 2019.
- S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel, “Self-critical sequence training for image captioning,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017, pp. 1179–1195.
- I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Advances in Neural Information Processing Systems, vol. 27, 2014.
- L. Ke, X. Li, Y. Bisk, A. Holtzman, Z. Gan, J. Liu, J. Gao, Y. Choi, and S. Srinivasa, “Tactical rewind: Self-correction via backtracking in vision-and-language navigation,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2019, pp. 6734–6742.
- H. Wang, W. Wang, T. Shu, W. Liang, and J. Shen, “Active visual information gathering for vision-language navigation,” in Computer Vision–ECCV 2020. Springer, 2020, pp. 307–322.
- H. Wang, W. Wang, W. Liang, C. Xiong, and J. Shen, “Structured scene memory for vision-language navigation,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2021, pp. 8451–8460.
- H. Tan, L. Yu, and M. Bansal, “Learning to navigate unseen environments: Back translation with environmental dropout,” in Proceedings of the 2019 Conference of the North. Association for Computational Linguistics, 2019, pp. 2610–2621.
- A. Parvaneh, E. Abbasnejad, D. Teney, J. Q. Shi, and A. van den Hengel, “Counterfactual vision-and-language navigation: Unravelling the unseen,” in Advances in Neural Information Processing Systems, vol. 33, 2020, pp. 5296–5307.
- T.-J. Fu, X. E. Wang, M. F. Peterson, S. T. Grafton, M. P. Eckstein, and W. Y. Wang, “Counterfactual vision-and-language navigation via adversarial path sampler,” in Computer Vision–ECCV 2020. Springer, 2020, pp. 71–86.
- X. Zhou, W. Liu, and Y. Mu, “Rethinking the spatial route prior in vision-and-language navigation,” arXiv preprint arXiv:2110.05728, 2021.
- S. Shah, D. Dey, C. Lovett, and A. Kapoor, “Airsim: High-fidelity visual and physical simulation for autonomous vehicles,” in Field and Service Robotics: Results of the 11th International Conference. Springer, 2018, pp. 621–635.
- V. Blukis, D. Misra, R. A. Knepper, and Y. Artzi, “Mapping navigation instructions to continuous control actions with position-visitation prediction,” in Proceedings of The 2nd Conference on Robot Learning. PMLR, 2018, pp. 505–518.
- V. Blukis, Y. Terme, E. Niklasson, R. A. Knepper, and Y. Artzi, “Learning to map natural language instructions to physical quadcopter control using simulated flight,” arXiv preprint arXiv:1910.09664, 2019.
- V. Blukis, R. A. Knepper, and Y. Artzi, “Few-shot object grounding and mapping for natural language robot instruction following,” arXiv preprint arXiv:2011.07384, 2020.
- M. Shridhar, J. Thomason, D. Gordon, Y. Bisk, W. Han, R. Mottaghi, L. Zettlemoyer, and D. Fox, “Alfred: A benchmark for interpreting grounded instructions for everyday tasks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10 740–10 749.
- T. Deruyttere, S. Vandenhende, D. Grujicic, L. Van Gool, and M.-F. Moens, “Talk2car: Taking control of your self-driving car,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 2088–2098.
- D. An, Y. Qi, Y. Huang, Q. Wu, L. Wang, and T. Tan, “Neighbor-view enhanced model for vision and language navigation,” in Proceedings of the 29th ACM International Conference on Multimedia. ACM, 2021, pp. 5101–5109.
- F. Yu, Z. Deng, K. Narasimhan, and O. Russakovsky, “Take the scenic route: Improving generalization in vision-and-language navigation,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). IEEE, 2020, pp. 4000–4004.
- X. E. Wang, V. Jain, E. Ie, W. Y. Wang, Z. Kozareva, and S. Ravi, “Environment-agnostic multitask learning for natural language grounded navigation,” in Computer Vision–ECCV 2020. Springer, 2020, pp. 413–430.
- D. S. Chaplot, L. Lee, R. Salakhutdinov, D. Parikh, and D. Batra, “Embodied multimodal multitask learning,” arXiv preprint arXiv:1902.01385, 2019.
- E. Wijmans, S. Datta, O. Maksymets, A. Das, G. Gkioxari, S. Lee, I. Essa, D. Parikh, and D. Batra, “Embodied question answering in photorealistic environments with point cloud perception,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2019, pp. 6652–6661.
- Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum learning,” in Proceedings of the 26th Annual International Conference on Machine Learning. ACM, 2009, pp. 41–48.
- J. Zhang, z. wei, J. Fan, and J. Peng, “Curriculum learning for vision-and-language navigation,” in Advances in Neural Information Processing Systems, vol. 34, 2021, pp. 13 328–13 339.
- Q. Xia, X. Li, C. Li, Y. Bisk, Z. Sui, J. Gao, Y. Choi, and N. A. Smith, “Multi-view learning for vision-and-language navigation,” arXiv preprint arXiv:2003.00857, 2020.
- V.-Q. Nguyen, M. Suganuma, and T. Okatani, “Look wide and interpret twice: Improving performance on interactive instruction-following tasks,” arXiv preprint arXiv:2106.00596, 2021.
- M. Chang, T. Gervet, M. Khanna, S. Yenamandra, D. Shah, S. Y. Min, K. Shah, C. Paxton, S. Gupta, D. Batra et al., “Goat: Go to any thing,” arXiv preprint arXiv:2311.06430, 2023.
- C. Huang, O. Mees, A. Zeng, and W. Burgard, “Visual language maps for robot navigation,” in 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 10 608–10 615.
- M. Z. Irshad, N. C. Mithun, Z. Seymour, H.-P. Chiu, S. Samarasekera, and R. Kumar, “Semantically-aware spatio-temporal reasoning agent for vision-and-language navigation in continuous environments,” in 2022 26th International Conference on Pattern Recognition (ICPR). IEEE, 2022, pp. 4065–4071.
- L. Xie, M. Zhang, Y. Li, W. Qin, Y. Yan, and E. Yin, “Vision-language navigation with beam-constrained global normalization,” IEEE Transactions on Neural Networks and Learning Systems, 2022.
- D. An, Y. Qi, Y. Li, Y. Huang, L. Wang, T. Tan, and J. Shao, “Bevbert: Multimodal map pre-training for language-guided navigation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 2737–2748.
- Z. Wang, X. Li, J. Yang, Y. Liu, and S. Jiang, “Gridmm: Grid memory map for vision-and-language navigation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 15 625–15 636.
- F. Zhu, Y. Zhu, X. Chang, and X. Liang, “Vision-language navigation with self-supervised auxiliary reasoning tasks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 012–10 022.
- G. A. Sigurdsson, J. Thomason, G. S. Sukhatme, and R. Piramuthu, “Rrex-bot: Remote referring expressions with a bag of tricks,” arXiv preprint arXiv:2301.12614, 2023.
- L. Arras, A. Osman, and W. Samek, “Ground truth evaluation of neural network explanations with clevr-xai,” arXiv preprint arXiv:2003.07258, 2020.
- Z. Wang, J. Li, Y. Hong, Y. Wang, Q. Wu, M. Bansal, S. Gould, H. Tan, and Y. Qiao, “Scaling data generation in vision-and-language navigation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 12 009–12 020.
- Peng Gao (402 papers)
- Peng Wang (832 papers)
- Feng Gao (240 papers)
- Fei Wang (574 papers)
- Ruyue Yuan (5 papers)