DELAN: Dual-Level Alignment for Vision-and-Language Navigation by Cross-Modal Contrastive Learning (2404.01994v1)
Abstract: Vision-and-Language navigation (VLN) requires an agent to navigate in unseen environment by following natural language instruction. For task completion, the agent needs to align and integrate various navigation modalities, including instruction, observation and navigation history. Existing works primarily concentrate on cross-modal attention at the fusion stage to achieve this objective. Nevertheless, modality features generated by disparate uni-encoders reside in their own spaces, leading to a decline in the quality of cross-modal fusion and decision. To address this problem, we propose a Dual-levEL AligNment (DELAN) framework by cross-modal contrastive learning. This framework is designed to align various navigation-related modalities before fusion, thereby enhancing cross-modal interaction and action decision-making. Specifically, we divide the pre-fusion alignment into dual levels: instruction-history level and landmark-observation level according to their semantic correlations. We also reconstruct a dual-level instruction for adaptation to the dual-level alignment. As the training signals for pre-fusion alignment are extremely limited, self-supervised contrastive learning strategies are employed to enforce the matching between different modalities. Our approach seamlessly integrates with the majority of existing models, resulting in improved navigation performance on various VLN benchmarks, including R2R, R4R, RxR and CVDN.
- On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757.
- Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3674–3683.
- History aware multimodal transformer for vision-and-language navigation. Advances in neural information processing systems, 34:5834–5847.
- Think global, act local: Dual-scale graph transformer for vision-and-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16537–16547.
- Learning disentanglement with decoupled labels for vision-language navigation. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVI, pages 309–329. Springer.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations.
- On uni-modal feature learning in supervised multi-modal learning. arXiv preprint arXiv:2305.01233.
- Speaker-follower models for vision-and-language navigation. Advances in Neural Information Processing Systems, 31.
- Vision-and-language navigation: A survey of tasks, methods, and future directions. arXiv preprint arXiv:2203.12667.
- Towards learning a generic agent for vision-and-language navigation via pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13137–13146.
- Language and visual entity relationship graph for agent navigation. Advances in Neural Information Processing Systems, 33:7685–7696.
- Sub-instruction aware vision-and-language navigation. arXiv preprint arXiv:2004.02707.
- Vln bert: A recurrent vision-and-language bert for navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1643–1653.
- General evaluation for instruction conditioned navigation using dynamic time warping. arXiv preprint arXiv:1907.05446.
- Stay on the path: Instruction fidelity in vision-and-language navigation. arXiv preprint arXiv:1905.12255.
- Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding. arXiv preprint arXiv:2010.07954.
- Envedit: Environment editing for vision-and-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15407–15417.
- Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705.
- Oscar: Object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision, pages 121–137. Springer.
- Mvptr: Multi-level semantic alignment for vision-language pre-training via multi-stage learning. In Proceedings of the 30th ACM International Conference on Multimedia, pages 4395–4405.
- Contrastive instruction-trajectory learning for vision-language navigation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 1592–1600.
- Multimodal transformer with variable-length memory for vision-and-language navigation. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVI, pages 380–397. Springer.
- Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
- Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32.
- Self-monitoring navigation agent via auxiliary progress estimation. arXiv preprint arXiv:1901.03035.
- The regretful agent: Heuristic-aided navigation through progress estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6732–6740.
- X-clip: End-to-end multi-grained contrastive learning for video-text retrieval. In Proceedings of the 30th ACM International Conference on Multimedia, pages 638–647.
- Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937. PMLR.
- Object-and-action aware model for visual language navigation. In European Conference on Computer Vision, pages 303–317. Springer.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR.
- Visitron: Visual semantics-aligned interactively trained object-navigator. arXiv preprint arXiv:2105.11589.
- Learning to navigate unseen environments: Back translation with environmental dropout. arXiv preprint arXiv:1904.04195.
- Hao Hao Tan and Mohit Bansal. 2019. Lxmert: Learning cross-modality encoder representations from transformers. ArXiv, abs/1908.07490.
- Vision-and-dialog navigation. In Conference on Robot Learning, pages 394–406. PMLR.
- Attention is all you need. Advances in neural information processing systems, 30.
- Attention is all you need. ArXiv, abs/1706.03762.
- Structured scene memory for vision-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8455–8464.
- Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6629–6638.
- Environment-agnostic multitask learning for natural language grounded navigation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIV 16, pages 413–430. Springer.
- Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3733–3742.
- Taco: Token-aware cascade contrastive learning for video-text alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11562–11572.
- Behavioral analysis of vision-and-language navigation agents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2574–2582.
- Filip: fine-grained interactive language-image pre-training. arXiv preprint arXiv:2111.07783.
- Curriculum learning for vision-and-language navigation. In NeurIPS.
- Yue Zhang and Parisa Kordjamshidi. 2022. Lovis: Learning orientation and visual signals for vision and language navigation. arXiv preprint arXiv:2209.12723.
- Vision-language navigation with self-supervised auxiliary reasoning tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10012–10022.
- Mengfei Du (5 papers)
- Binhao Wu (4 papers)
- Jiwen Zhang (16 papers)
- Zhihao Fan (28 papers)
- Zejun Li (18 papers)
- Ruipu Luo (6 papers)
- Xuanjing Huang (287 papers)
- Zhongyu Wei (98 papers)