Towards Deviation-Robust Agent Navigation via Perturbation-Aware Contrastive Learning (2403.05770v1)
Abstract: Vision-and-language navigation (VLN) asks an agent to follow a given language instruction to navigate through a real 3D environment. Despite significant advances, conventional VLN agents are trained typically under disturbance-free environments and may easily fail in real-world scenarios, since they are unaware of how to deal with various possible disturbances, such as sudden obstacles or human interruptions, which widely exist and may usually cause an unexpected route deviation. In this paper, we present a model-agnostic training paradigm, called Progressive Perturbation-aware Contrastive Learning (PROPER) to enhance the generalization ability of existing VLN agents, by requiring them to learn towards deviation-robust navigation. Specifically, a simple yet effective path perturbation scheme is introduced to implement the route deviation, with which the agent is required to still navigate successfully following the original instruction. Since directly enforcing the agent to learn perturbed trajectories may lead to inefficient training, a progressively perturbed trajectory augmentation strategy is designed, where the agent can self-adaptively learn to navigate under perturbation with the improvement of its navigation performance for each specific trajectory. For encouraging the agent to well capture the difference brought by perturbation, a perturbation-aware contrastive learning mechanism is further developed by contrasting perturbation-free trajectory encodings and perturbation-based counterparts. Extensive experiments on R2R show that PROPER can benefit multiple VLN baselines in perturbation-free scenarios. We further collect the perturbed path data to construct an introspection subset based on the R2R, called Path-Perturbed R2R (PP-R2R). The results on PP-R2R show unsatisfying robustness of popular VLN agents and the capability of PROPER in improving the navigation robustness.
- P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sunderhauf, I. Reid, S. Gould, and A. van den Hengel, “Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments,” in CVPR, 2018.
- C.-Y. Ma, jiasen lu, Z. Wu, G. AlRegib, Z. Kira, richard socher, and C. Xiong, “Self-monitoring navigation agent via auxiliary progress estimation,” in ICLR, 2019.
- F. Landi, L. Baraldi, M. Cornia, M. Corsini, and R. Cucchiara, “Perceive, transform, and act: Multi-modal attention networks for vision-and-language navigation,” arXiv preprint arXiv:1911.12377, 2019.
- D. Fried, R. Hu, V. Cirik, A. Rohrbach, J. Andreas, L.-P. Morency, T. Berg-Kirkpatrick, K. Saenko, D. Klein, and T. Darrell, “Speaker-follower models for vision-and-language navigation,” in NeurIPS, 2018.
- H. Tan, L. Yu, and M. Bansal, “Learning to navigate unseen environments: Back translation with environmental dropout,” in NAACL-HLT, 2019.
- F. Zhu, Y. Zhu, X. Chang, and X. Liang, “Vision-language navigation with self-supervised auxiliary reasoning tasks,” in CVPR, 2020.
- X. Wang, Q. Huang, A. Celikyilmaz, J. Gao, D. Shen, Y.-F. Wang, W. Y. Wang, and L. Zhang, “Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation,” in CVPR, 2019.
- X. Chen, H. Fan, R. B. Girshick, and K. He, “Improved baselines with momentum contrastive learning,” arXiv preprint arXiv:2003.04297, 2020.
- M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin, “Unsupervised learning of visual features by contrasting cluster assignments,” in NeurIPS, 2020.
- O. Henaff, “Data-efficient image recognition with contrastive predictive coding,” in ICML, 2020.
- P. Bachman, R. D. Hjelm, and W. Buchwalter, “Learning representations by maximizing mutual information across views,” in NeurIPS, 2019.
- Y. Hong, C. Rodriguez, Y. Qi, Q. Wu, and S. Gould, “Language and visual entity relationship graph for agent navigation,” in NeurIPS, 2020.
- Y. Hong, Q. Wu, Y. Qi, C. Rodriguez-Opazo, and S. Gould, “Vln bert: A recurrent vision-and-language bert for navigation,” in CVPR, 2021.
- C. Yin, J. Tang, Z. Xu, and Y. Wang, “Memory augmented deep recurrent neural network for video question answering,” IEEE Transactions on Neural Networks, 2020.
- J. Yu, X. Jiang, Z. Qin, W. Zhang, Y. Hu, and Q. Wu, “Learning dual encoding model for adaptive visual understanding in visual dialogue,” IEEE Transactions on Image Processing, 2021.
- M. Zhang, Y. Yang, H. Zhang, Y. Ji, H. T. Shen, and T.-S. Chua, “More is better: Precise and detailed image captioning using online positive recall and missing concepts mining,” IEEE Transactions on Image Processing, 2019.
- A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, S. Lee, J. M. F. Moura, D. Parikh, and D. Batra, “Visual dialog,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
- L. Gao, X. Li, J. Song, and H. T. Shen, “Hierarchical lstms with adaptive attention for visual captioning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
- P. Wang, Q. Wu, C. Shen, A. Dick, and A. van den Hengel, “Fvqa: Fact-based visual question answering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
- Y. Qi, Z. Pan, S. Zhang, A. van den Hengel, and Q. Wu, “Object-and-action aware model for visual language navigation,” in ECCV, 2020.
- W. Hao, C. Li, X. Li, L. Carin, and J. Gao, “Towards learning a generic agent for vision-and-language navigation via pre-training,” in CVPR, 2020.
- X. Li, C. Li, Q. Xia, Y. Bisk, A. Çelikyilmaz, J. Gao, N. A. Smith, and Y. Choi, “Robust navigation with language pretraining and stochastic sampling.” in EMNLP-IJCNLP, 2019.
- A. Majumdar, A. Shrivastava, S. Lee, P. Anderson, D. Parikh, and D. Batra, “Improving vision-and-language navigation with image-text pairs from the web,” in ECCV, 2020.
- H. Wang, Q. Wu, and C. Shen, “Soft expert reward learning for vision-and-language navigation,” in ECCV, 2020.
- Y.-C. Chen, L. Li, L. Yu, A. E. Kholy, F. Ahmed, Z. Gan, Y. Cheng, and J. Liu, “Uniter: Universal image-text representation learning,” in ECCV, 2020.
- G. Li, N. Duan, Y. Fang, M. Gong, and D. Jiang, “Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training.” in AAAI, 2020.
- H. Tan and M. Bansal, “Lxmert: Learning cross-modality encoder representations from transformers,” in EMNLP-IJCNLP, 2019.
- J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks,” in NeurIPS, 2019.
- T.-J. Fu, X. E. Wang, M. F. Peterson, S. T. Grafton, M. P. Eckstein, and W. Y. Wang, “Counterfactual vision-and-language navigation via adversarial path sampler.” in ECCV, 2020.
- M. Wang, R. Luo, A. O. Onol, and T. Padir, “Affordance-based mobile robot navigation among movable obstacles,” in IROS, 2020.
- M. Moghadam and G. H. Elkaim, “An autonomous driving framework for long-term decision-making and short-term trajectory planning on frenet space,” in CASE, 2021.
- S. D. Morad, R. Mecca, R. P. K. Poudel, S. Liwicki, and R. Cipolla, “Embodied visual navigation with automatic curriculum learning in real environments,” IEEE Robotics and Automation Letters, 2020.
- J. Cheng, Y. Chen, Q. Zhang, L. Gan, and M. Liu, “Real-time trajectory planning for autonomous driving with gaussian process and incremental refinement,” in ICRA, 2022.
- S. Kareer, N. Yokoyama, D. Batra, S. Ha, and J. Truong, “Vinl: Visual navigation and locomotion over obstacles,” arXiv preprint arXiv:2210.14791, 2022.
- P. Wenzel, T. Schön, L. Leal-Taixé, and D. Cremers, “Vision-based mobile robotics obstacle avoidance with deep reinforcement learning,” ICRA, 2021.
- P. R. Blum, P. Crowley, and G. Lykotrafitis, “Vision-based navigation and obstacle avoidance via deep reinforcement learning,” arXiv preprint arXiv:2211.05243, 2022.
- P. Giannakopoulos, A. Pikrakis, and Y. Cotronis, “A deep reinforcement learning approach to audio-based navigation in a multi-speaker environment,” in ICASSP, 2021.
- Y. Du, N. J. Hetherington, C. L. Oon, W. P. Chan, C. P. Quintero, E. Croft, and H. M. V. der Loos, “Group surfing: A pedestrian-based approach to sidewalk robot navigation,” in ICRA, 2019.
- H. Niu, Z. Ji, F. Arvin, B. Lennox, H. Yin, and J. Carrasco, “Accelerated sim-to-real deep reinforcement learning: Learning collision avoidance from human player,” in SII, 2021.
- Q. Wang, D. Li, and J. Sifakis, “Safe and efficient collision avoidance control for autonomous vehicles,” MEMOCODE, 2020.
- S. Song, K. Saunders, Y. Yue, and J. Liu, “Smooth trajectory collision avoidance through deep reinforcement learning,” arXiv preprint arXiv:2210.06377, 2022.
- A. Li, L. Sun, W. Zhan, M. Tomizuka, and M. Chen, “Prediction-based reachability for collision avoidance in autonomous driving,” in ICRA, 2020.
- X. Shen, E. L. Zhu, Y. R. Stürz, and F. Borrelli, “Collision avoidance in tightly-constrained environments without coordination: a hierarchical control approach,” in ICRA, 2020.
- A. Liu, T. Huang, X. Liu, Y. Xu, Y. Ma, X. Chen, S. J. Maybank, and D. Tao, “Spatiotemporal attacks for embodied agents.” in ECCV, 2020.
- A. Hamdi, M. Mueller, and B. Ghanem, “Sada: Semantic adversarial diagnostic attacks for autonomous applications,” in AAAI, 2020.
- Y. Cao, C. Xiao, A. Anandkumar, D. Xu, and M. Pavone, “Advdo: Realistic adversarial attacks for trajectory prediction,” in ECCV, 2022.
- Z. Cheng, J. Liang, H. Choi, G. Tao, Z. Cao, D. Liu, and X. Zhang, “Physical attack on monocular depth estimation with optimal adversarial patches,” in ECCV, 2022.
- J. I. Choi and Q. Tian, “Adversarial attack and defense of yolo detectors in autonomous driving scenarios,” IEEE Intelligent Vehicles Symposium, 2022.
- Q. Zhang, S. Hu, J. Sun, Q. A. Chen, and Z. M. Mao, “On adversarial robustness of trajectory prediction for autonomous vehicles,” in CVPR, 2022.
- P. Buddareddygari, T. Zhang, Y. Yang, and Y. Ren, “Targeted attack on deep rl-based autonomous driving with learned visual patterns,” in ICRA, 2021.
- Y. Tian, D. Krishnan, and P. Isola, “Contrastive multiview coding,” in ECCV, 2019.
- Y. Tian, C. Sun, B. Poole, D. Krishnan, C. Schmid, and P. Isola, “What makes for good views for contrastive learning,” in NeurIPS, 2020.
- T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in ICML, 2020.
- Z. Chi, L. Dong, F. Wei, N. Yang, S. Singhal, W. Wang, X. Song, X.-L. Mao, H. Huang, and M. Zhou, “Infoxlm: An information-theoretic framework for cross-lingual language model pre-training,” in NAACL, 2021.
- H. Wu, T. Ma, L. Wu, T. Manyumwa, and S. Ji, “Unsupervised reference-free summary quality evaluation via contrastive learning.” in EMNLP, 2020.
- H. Fang and P. Xie, “Cert: Contrastive self-supervised learning for language understanding,” arXiv preprint arXiv:2005.12766, 2020.
- J. M. Giorgi, O. Nitski, G. D. Bader, and B. Wang, “Declutr: Deep contrastive learning for unsupervised textual representations,” arXiv preprint arXiv:2006.03659, 2020.
- N. Saunshi, O. Plevrakis, S. Arora, M. Khodak, and H. Khandeparkar, “A theoretical analysis of contrastive unsupervised representation learning,” in ICML, 2019.
- D. Iter, K. Guu, L. Lansing, and D. Jurafsky, “Pretraining with contrastive sentence objectives improves discourse performance of language models,” in ACL, 2020.
- J. Zeng and P. Xie, “Contrastive self-supervised learning for graph classification,” in AAAI, 2021.
- H. Hafidi, M. Ghogho, P. Ciblat, and A. Swami, “Graphcl: Contrastive self-supervised learning of graph representations.” arXiv preprint arXiv:2007.08025, 2020.
- J. Qiu, Q. Chen, Y. Dong, J. Zhang, H. Yang, M. Ding, K. Wang, and J. Tang, “Gcc: Graph contrastive coding for graph neural network pre-training.” in ACM SIGKDD, 2020.
- K. Hassani and A. H. Khasahmadi, “Contrastive multi-view representation learning on graphs.” in ICML, 2020.
- Y. You, T. Chen, Y. Sui, T. Chen, Z. Wang, and Y. Shen, “Graph contrastive learning with augmentations,” in NeurIPS, 2020.
- S. Suresh, P. Li, C. Hao, and J. Neville, “Adversarial graph augmentation to improve graph contrastive learning,” in NeurIPS, 2021.
- J. Thomason, M. Murray, M. Cakmak, and L. Zettlemoyer, “Vision-and-dialog navigation,” in CoRL, 2019.
- X. E. Wang, V. Jain, E. Ie, W. Y. Wang, Z. Kozareva, and S. Ravi, “Environment-agnostic multitask learning for natural language grounded navigation,” in ECCV, 2020.
- M. Kim, J. Tack, and S. J. Hwang, “Adversarial self-supervised contrastive learning,” in NeurIPS, 2020.
- C.-Y. Ma, Z. Wu, G. AlRegib, C. Xiong, and Z. Kira, “The regretful agent: Heuristic-aided navigation through progress estimation,” in CVPR, 2019.
- H. Chen, A. Suhr, D. Misra, N. Snavely, and Y. Artzi, “Touchdown: Natural language navigation and spatial reasoning in visual street environments,” in CVPR, 2019.
- Y. Qi, Q. Wu, P. Anderson, X. Wang, W. Y. Wang, C. Shen, and A. van den Hengel, “Reverie: Remote embodied visual referring expression in real indoor environments,” in CVPR, 2020.