Vision-and-Language Navigation via Causal Learning (2404.10241v1)
Abstract: In the pursuit of robust and generalizable environment perception and language understanding, the ubiquitous challenge of dataset bias continues to plague vision-and-language navigation (VLN) agents, hindering their performance in unseen environments. This paper introduces the generalized cross-modal causal transformer (GOAT), a pioneering solution rooted in the paradigm of causal inference. By delving into both observable and unobservable confounders within vision, language, and history, we propose the back-door and front-door adjustment causal learning (BACL and FACL) modules to promote unbiased learning by comprehensively mitigating potential spurious correlations. Additionally, to capture global confounder features, we propose a cross-modal feature pooling (CFP) module supervised by contrastive learning, which is also shown to be effective in improving cross-modal representations during pre-training. Extensive experiments across multiple VLN datasets (R2R, REVERIE, RxR, and SOON) underscore the superiority of our proposed method over previous state-of-the-art approaches. Code is available at https://github.com/CrystalSixone/VLN-GOAT.
- Counterfactual vision and language learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10044–10054, 2020.
- Neighbor-view enhanced model for vision and language navigation. In Proceedings of the 29th ACM International Conference on Multimedia, pages 5101–5109, 2021.
- Bevbert: Multimodal map pre-training for language-guided navigation. Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023.
- Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3674–3683, 2018.
- Layer normalization. ArXiv, abs/1607.06450, 2016.
- The dropout learning algorithm. Artificial intelligence, 210:78–122, 2014.
- Matterport3D: Learning from RGB-D data in indoor environments. International Conference on 3D Vision (3DV), 2017.
- History aware multimodal transformer for vision-and-language navigation. Advances in Neural Information Processing Systems, 34, 2021.
- Think global, act local: Dual-scale graph transformer for vision-and-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16537–16547, 2022.
- Grounded entity-landmark adaptive pre-training for vision-and-language navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12043–12053, 2023.
- Unbiased directed object attention graph for object navigation. In Proceedings of the 30th ACM International Conference on Multimedia, page 3617–3627, New York, NY, USA, 2022. Association for Computing Machinery.
- Multiple thinking achieving meta-ability decoupling for object navigation. In International Conference on Machine Learning (ICML), 2023a.
- Search for or navigate to? dual adaptive thinking for object navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8250–8259, 2023b.
- Foam: A follower-aware speaker model for vision-and-language navigation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4332–4340, 2022.
- An empirical study of training end-to-end vision-and-language transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18166–18176, 2022.
- Speaker-follower models for vision-and-language navigation. Advances in Neural Information Processing Systems, 31, 2018.
- Causal inference in statistics: A primer. John Wiley & Sons, 2016.
- Airbert: In-domain pretraining for vision-and-language navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1634–1643, 2021.
- Towards learning a generic agent for vision-and-language navigation via pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13137–13146, 2020.
- Learning depth representation from rgb-d videos by time-aware contrastive pre-training. IEEE Transactions on Circuits and Systems for Video Technology, pages 1–1, 2023a.
- Mlanet: Multi-level attention network with sub-instruction for continuous vision-and-language navigation. arXiv preprint arXiv:2303.01396, 2023b.
- Vln bert: A recurrent vision-and-language bert for navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1643–1653, 2021.
- Learning navigational visual representations with semantic map supervision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3055–3067, 2023.
- Are you looking? grounding to multiple modalities in vision-and-language navigation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6551–6557, 2019.
- Geovln: Learning geometry-enhanced visual representation with slot attention for vision-and-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23212–23221, 2023.
- Stay on the path: Instruction fidelity in vision-and-language navigation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1862–1872, 2019.
- A new path: Scaling vision-and-language navigation with synthetic instructions and imitation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10813–10823, 2023.
- Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186, 2019.
- Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4392–4412, 2020.
- Improving vision-and-language navigation by generating future-view image semantics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10803–10812, 2023.
- Improving cross-modal alignment in vision language navigation via syntactic information. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1041–1050, 2021.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022a.
- From representation to reasoning: Towards both evidence and commonsense reasoning for video question-answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21273–21282, 2022b.
- Envedit: Environment editing for vision-and-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15407–15417, 2022c.
- Adapt: Vision-language navigation with modality-aligned action prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15396–15406, 2022.
- Learning vision-and-language navigation from youtube videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8317–8326, 2023.
- Scene-intuitive agent for remote embodied visual grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7036–7045, 2021.
- Show, deconfound and tell: Image captioning with causal inference. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18041–18050, 2022.
- Vision-language navigation with random environmental mixup. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1644–1654, 2021.
- Bird’s-eye-view scene graph for vision-language navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10968–10980, 2023a.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- Cross-modal causal relational reasoning for event-level visual question answering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023b.
- Decoupled weight decay regularization. In International Conference on Learning Representations, 2018.
- Improving vision-and-language navigation with image-text pairs from the web. In European Conference on Computer Vision, pages 259–274. Springer, 2020.
- Soat: A scene-and object-aware transformer for vision-and-language navigation. Advances in Neural Information Processing Systems, 34:7357–7367, 2021.
- Instance-level semantic maps for vision language navigation. arXiv preprint arXiv:2305.12363, 2023.
- Help, anna! visual navigation with natural multimodal assistance via retrospective curiosity-encouraging imitation learning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 684–695, 2019.
- The deep regression bayesian network and its applications: Probabilistic deep learning for computer vision. IEEE Signal Processing Magazine, 35(1):101–111, 2018.
- Counterfactual vqa: A cause-effect look at language bias. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12700–12710, 2021.
- Counterfactual vision-and-language navigation: Unravelling the unseen. Advances in Neural Information Processing Systems, 33:5296–5307, 2020.
- Judea Pearl. Causality. Cambridge university press, 2009.
- The book of why: the new science of cause and effect. Basic books, 2018.
- Reverie: Remote embodied visual referring expression in real indoor environments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9982–9991, 2020.
- Hop+: History-enhanced and order-aware pre-training for vision-and-language navigation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023a.
- Vln-petl: Parameter-efficient transfer learning for vision-and-language navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15443–15452, 2023b.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
- How much can clip benefit vision-and-language tasks? In International Conference on Learning Representations.
- Learning to navigate unseen environments: Back translation with environmental dropout. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2610–2621, 2019.
- Vision-and-dialog navigation. In Conference on Robot Learning, pages 394–406. PMLR, 2020.
- Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
- Counterfactual cycle-consistent learning for instruction following and generation in vision-language navigation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15471–15481, 2022a.
- Res-sts: Referring expression speaker via self-training with scorer for goal-oriented vision-language navigation. IEEE Transactions on Circuits and Systems for Video Technology, 2023a.
- A dual semantic-aware recurrent global-adaptive network for vision-and-language navigation. In International Joint Conferences on Artificial Intelligence (IJCAI), 2023b.
- Pasts: Progress-aware spatio-temporal transformer speaker for vision-and-language navigation. arXiv preprint arXiv:2305.11918, 2023c.
- Less is more: Generating grounded navigation instructions from landmarks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15428–15438, 2022b.
- Visual commonsense r-cnn. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10760–10770, 2020a.
- Causal attention for unbiased visual recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3091–3100, 2021.
- Vision-language navigation policy learning and adaptation. IEEE transactions on pattern analysis and machine intelligence, 2020b.
- Meta-causal feature learning for out-of-distribution generalization. In Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part VI, pages 530–545. Springer, 2023d.
- Scaling data generation in vision-and-language navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12009–12020, 2023e.
- Gridmm: Grid memory map for vision-and-language navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15625–15636, 2023f.
- Chain-of-thought prompting elicits reasoning in large language models, 2023.
- Deconfounded image captioning: A causal retrospect. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021a.
- Causal attention for vision-language tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9847–9857, 2021b.
- Interventional few-shot learning. Advances in neural information processing systems, 33:2734–2746, 2020.
- Multiple adverse weather conditions adaptation for object detection via causal intervention. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–1, 2022.
- 3d-aware object goal navigation via simultaneous exploration and identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6672–6682, 2023.
- Devlbert: Learning deconfounded visio-linguistic representations. In Proceedings of the 28th ACM International Conference on Multimedia, pages 4373–4382, 2020.
- Diagnosing the environment bias in vision-and-language navigation. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, 2021.
- Target-driven structured transformer planner for vision-language navigation. In Proceedings of the 30th ACM International Conference on Multimedia, pages 4194–4203, 2022.
- Soon: Scenario oriented object navigation with graph-based exploration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12689–12699, 2021.
- Diagnosing vision-and-language navigation: What really matters. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5981–5993, 2022.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.