Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 91 tok/s
Gemini 2.5 Pro 58 tok/s Pro
GPT-5 Medium 29 tok/s
GPT-5 High 29 tok/s Pro
GPT-4o 102 tok/s
GPT OSS 120B 462 tok/s Pro
Kimi K2 181 tok/s Pro
2000 character limit reached

Causality-based Cross-Modal Representation Learning for Vision-and-Language Navigation (2403.03405v1)

Published 6 Mar 2024 in cs.CV

Abstract: Vision-and-Language Navigation (VLN) has gained significant research interest in recent years due to its potential applications in real-world scenarios. However, existing VLN methods struggle with the issue of spurious associations, resulting in poor generalization with a significant performance gap between seen and unseen environments. In this paper, we tackle this challenge by proposing a unified framework CausalVLN based on the causal learning paradigm to train a robust navigator capable of learning unbiased feature representations. Specifically, we establish reasonable assumptions about confounders for vision and language in VLN using the structured causal model (SCM). Building upon this, we propose an iterative backdoor-based representation learning (IBRL) method that allows for the adaptive and effective intervention on confounders. Furthermore, we introduce the visual and linguistic backdoor causal encoders to enable unbiased feature expression for multi-modalities during training and validation, enhancing the agent's capability to generalize across different environments. Experiments on three VLN datasets (R2R, RxR, and REVERIE) showcase the superiority of our proposed method over previous state-of-the-art approaches. Moreover, detailed visualization analysis demonstrates the effectiveness of CausalVLN in significantly narrowing down the performance gap between seen and unseen environments, underscoring its strong generalization capability.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sünderhauf, I. Reid, S. Gould, and A. Van Den Hengel, “Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3674–3683.
  2. Y. Qi, Q. Wu, P. Anderson, X. Wang, W. Y. Wang, C. Shen, and A. v. d. Hengel, “Reverie: Remote embodied visual referring expression in real indoor environments,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9982–9991.
  3. A. Ku, P. Anderson, R. Patel, E. Ie, and J. Baldridge, “Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 4392–4412.
  4. T. Wang, Z. Wu, and D. Wang, “Visual perception generalization for vision-and-language navigation via meta-learning,” IEEE Transactions on Neural Networks and Learning Systems, 2021.
  5. B. Lin, Y. Zhu, Y. Long, X. Liang, Q. Ye, and L. Lin, “Adversarial reinforced instruction attacker for robust vision-language navigation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 10, pp. 7175–7189, 2021.
  6. L. Xie, M. Zhang, Y. Li, W. Qin, Y. Yan, and E. Yin, “Vision–language navigation with beam-constrained global normalization,” IEEE Transactions on Neural Networks and Learning Systems, 2022.
  7. T.-J. Fu, X. E. Wang, M. F. Peterson, S. T. Grafton, M. P. Eckstein, and W. Y. Wang, “Counterfactual vision-and-language navigation via adversarial path sampler,” in European Conference on Computer Vision.   Springer, 2020, pp. 71–86.
  8. Y. Zhang, H. Tan, and M. Bansal, “Diagnosing the environment bias in vision-and-language navigation,” in Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, ser. IJCAI’20, 2021.
  9. D. Fried, R. Hu, V. Cirik, A. Rohrbach, J. Andreas, L.-P. Morency, T. Berg-Kirkpatrick, K. Saenko, D. Klein, and T. Darrell, “Speaker-follower models for vision-and-language navigation,” Advances in Neural Information Processing Systems, vol. 31, 2018.
  10. H. Tan, L. Yu, and M. Bansal, “Learning to navigate unseen environments: Back translation with environmental dropout,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 2610–2621.
  11. H. Wang, W. Liang, J. Shen, L. Van Gool, and W. Wang, “Counterfactual cycle-consistent learning for instruction following and generation in vision-language navigation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 15 471–15 481.
  12. L. Wang, Z. He, R. Dang, H. Chen, C. Liu, and Q. Chen, “Res-sts: Referring expression speaker via self-training with scorer for goal-oriented vision-language navigation,” IEEE Transactions on Circuits and Systems for Video Technology, 2023.
  13. J. Li, H. Tan, and M. Bansal, “Envedit: Environment editing for vision-and-language navigation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 15 407–15 417.
  14. V. Jain, G. Magalhaes, A. Ku, A. Vaswani, E. Ie, and J. Baldridge, “Stay on the path: Instruction fidelity in vision-and-language navigation,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 1862–1872.
  15. C. Liu, F. Zhu, X. Chang, X. Liang, Z. Ge, and Y.-D. Shen, “Vision-language navigation with random environmental mixup,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1644–1654.
  16. A. Majumdar, A. Shrivastava, S. Lee, P. Anderson, D. Parikh, and D. Batra, “Improving vision-and-language navigation with image-text pairs from the web,” in European Conference on Computer Vision.   Springer, 2020, pp. 259–274.
  17. P.-L. Guhur, M. Tapaswi, S. Chen, I. Laptev, and C. Schmid, “Airbert: In-domain pretraining for vision-and-language navigation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1634–1643.
  18. W. Hao, C. Li, X. Li, L. Carin, and J. Gao, “Towards learning a generic agent for vision-and-language navigation via pre-training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 13 137–13 146.
  19. Y. Hong, Q. Wu, Y. Qi, C. Rodriguez-Opazo, and S. Gould, “Vln bert: A recurrent vision-and-language bert for navigation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 1643–1653.
  20. Y. Qiao, Y. Qi, Y. Hong, Z. Yu, P. Wang, and Q. Wu, “Hop: History-and-order aware pre-training for vision-and-language navigation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 15 418–15 427.
  21. Y. Qiao, Y. Qi, Y. Hong, Z. Yu, P. Wang, and Q. Wu, “Hop+: History-enhanced and order-aware pre-training for vision-and-language navigation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  22. X. Yang, H. Zhang, and J. Cai, “Deconfounded image captioning: A causal retrospect,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
  23. S. Zhang, T. Jiang, T. Wang, K. Kuang, Z. Zhao, J. Zhu, J. Yu, H. Yang, and F. Wu, “Devlbert: Learning deconfounded visio-linguistic representations,” in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 4373–4382.
  24. T. Wang, J. Huang, H. Zhang, and Q. Sun, “Visual commonsense r-cnn,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 760–10 770.
  25. J. Qi, Y. Niu, J. Huang, and H. Zhang, “Two causal principles for improving visual dialog,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10 860–10 869.
  26. R. Dang, Z. Shi, L. Wang, Z. He, C. Liu, and Q. Chen, “Unbiased directed object attention graph for object navigation,” in Proceedings of the 30th ACM International Conference on Multimedia, ser. MM ’22.   New York, NY, USA: Association for Computing Machinery, 2022, p. 3617–3627.
  27. L. Wang, Z. He, j. Tang, R. Dang, n. Wang, C. Liu, and Q. Chen, “A dual semantic-aware recurrent global-adaptive network for vision-and-language navigation,” in International Joint Conferences on Artificial Intelligence (IJCAI), 2023.
  28. X. Wang, Q. Huang, A. Celikyilmaz, J. Gao, D. Shen, Y.-F. Wang, W. Wang, and L. Zhang, “Vision-language navigation policy learning and adaptation,” IEEE transactions on pattern analysis and machine intelligence, 2020.
  29. D. An, Y. Qi, Y. Huang, Q. Wu, L. Wang, and T. Tan, “Neighbor-view enhanced model for vision and language navigation,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 5101–5109.
  30. Z. He, L. Wang, S. Li, Q. Yan, C. Liu, and Q. Chen, “Mlanet: Multi-level attention network with sub-instruction for continuous vision-and-language navigation,” arXiv preprint arXiv:2303.01396, 2023.
  31. Z.-Y. Dou and N. Peng, “Foam: A follower-aware speaker model for vision-and-language navigation,” in Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022, pp. 4332–4340.
  32. L. Wang, C. Liu, Z. He, S. Li, Q. Yan, H. Chen, and Q. Chen, “Pasts: Progress-aware spatio-temporal transformer speaker for vision-and-language navigation,” arXiv preprint arXiv:2305.11918, 2023.
  33. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
  34. S. Chen, P.-L. Guhur, C. Schmid, and I. Laptev, “History aware multimodal transformer for vision-and-language navigation,” Advances in Neural Information Processing Systems, vol. 34, 2021.
  35. S. Chen, P.-L. Guhur, M. Tapaswi, C. Schmid, and I. Laptev, “Think global, act local: Dual-scale graph transformer for vision-and-language navigation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 537–16 547.
  36. A. Kamath, P. Anderson, S. Wang, J. Y. Koh, A. Ku, A. Waters, Y. Yang, J. Baldridge, and Z. Parekh, “A new path: Scaling vision-and-language navigation with synthetic instructions and imitation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 10 813–10 823.
  37. S. Wang, C. Montgomery, J. Orbay, V. Birodkar, A. Faust, I. Gur, N. Jaques, A. Waters, J. Baldridge, and P. Anderson, “Less is more: Generating grounded navigation instructions from landmarks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 15 428–15 438.
  38. R. Dang, L. Wang, Z. He, S. Su, C. Liu, and Q. Chen, “Search for or navigate to? dual adaptive thinking for object navigation,” arXiv preprint arXiv:2208.00553, 2022.
  39. R. Dang, L. Chen, L. Wang, H. Zongtao, C. Liu, and Q. Chen, “Multiple thinking achieving meta-ability decoupling for object navigation,” in International Conference on Machine Learning (ICML), 2023.
  40. Y. Qi, Z. Pan, Y. Hong, M.-H. Yang, A. van den Hengel, and Q. Wu, “The road to know-where: An object-and-room informed sequential bert for indoor vision-language navigation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1655–1664.
  41. A. Moudgil, A. Majumdar, H. Agrawal, S. Lee, and D. Batra, “Soat: A scene-and object-aware transformer for vision-and-language navigation,” Advances in Neural Information Processing Systems, vol. 34, pp. 7357–7367, 2021.
  42. Y. Qi, Z. Pan, S. Zhang, A. van den Hengel, and Q. Wu, “Object-and-action aware model for visual language navigation,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X 16.   Springer, 2020, pp. 303–317.
  43. J. Chen, C. Gao, E. Meng, Q. Zhang, and S. Liu, “Reinforced structured state-evolution for vision-language navigation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 15 450–15 459.
  44. Y. Niu, K. Tang, H. Zhang, Z. Lu, X.-S. Hua, and J.-R. Wen, “Counterfactual vqa: A cause-effect look at language bias,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 12 700–12 710.
  45. W. Wang, J. Gao, and C. Xu, “Weakly-supervised video object grounding via causal intervention,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 3, pp. 3933–3948, 2023.
  46. T. Wang, C. Zhou, Q. Sun, and H. Zhang, “Causal attention for unbiased visual recognition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3091–3100.
  47. H. Zhang, L. Xiao, X. Cao, and H. Foroosh, “Multiple adverse weather conditions adaptation for object detection via causal intervention,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1, 2022.
  48. Y. Wang, X. Li, Z. Qi, J. Li, X. Li, X. Meng, and L. Meng, “Meta-causal feature learning for out-of-distribution generalization,” in Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part VI.   Springer, 2023, pp. 530–545.
  49. B. Liu, D. Wang, X. Yang, Y. Zhou, R. Yao, Z. Shao, and J. Zhao, “Show, deconfound and tell: Image captioning with causal inference,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 041–18 050.
  50. J. Li, L. Niu, and L. Zhang, “From representation to reasoning: Towards both evidence and commonsense reasoning for video question-answering,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 21 273–21 282.
  51. D. Lopez-Paz, R. Nishihara, S. Chintala, B. Scholkopf, and L. Bottou, “Discovering causal signals in images,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 6979–6987.
  52. P. Baldi and P. Sadowski, “The dropout learning algorithm,” Artificial intelligence, vol. 210, pp. 78–122, 2014.
  53. A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song, A. Zeng, and Y. Zhang, “Matterport3D: Learning from RGB-D data in indoor environments,” International Conference on 3D Vision (3DV), 2017.
  54. J. Pearl et al., “Models, reasoning and inference,” Cambridge, UK: CambridgeUniversityPress, vol. 19, no. 2, 2000.
  55. P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, “Bottom-up and top-down attention for image captioning and visual question answering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6077–6086.
  56. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International Conference on Machine Learning.   PMLR, 2021, pp. 8748–8763.
  57. J. D. M.-W. C. Kenton and L. K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of NAACL-HLT, 2019, pp. 4171–4186.
  58. J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in International Conference on Machine Learning.   PMLR, 2022, pp. 12 888–12 900.
  59. H. Tan and M. Bansal, “Lxmert: Learning cross-modality encoder representations from transformers,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 5100–5111.
  60. X. Lin, G. Li, and Y. Yu, “Scene-intuitive agent for remote embodied visual grounding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021, pp. 7036–7045.
  61. F. Zhu, Y. Zhu, X. Chang, and X. Liang, “Vision-language navigation with self-supervised auxiliary reasoning tasks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 012–10 022.
  62. S. Shen, L. H. Li, H. Tan, M. Bansal, A. Rohrbach, K.-W. Chang, Z. Yao, and K. Keutzer, “How much can clip benefit vision-and-language tasks?” in International Conference on Learning Representations.
  63. I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
  64. J. Li, H. Tan, and M. Bansal, “Improving cross-modal alignment in vision language navigation via syntactic information,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021, pp. 1041–1050.
  65. B. Lin, Y. Zhu, Z. Chen, X. Liang, J. Liu, and X. Liang, “Adapt: Vision-language navigation with modality-aligned action prompts,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 15 396–15 406.
  66. G. Ilharco, V. Jain, A. Ku, E. Ie, and J. Baldridge, “General evaluation for instruction conditioned navigation using dynamic time warping,” arXiv preprint arXiv:1907.05446, 2019.
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.