DAP: Domain-aware Prompt Learning for Vision-and-Language Navigation (2311.17812v4)
Abstract: Following language instructions to navigate in unseen environments is a challenging task for autonomous embodied agents. With strong representation capabilities, pretrained vision-and-LLMs are widely used in VLN. However, most of them are trained on web-crawled general-purpose datasets, which incurs a considerable domain gap when used for VLN tasks. To address the problem, we propose a novel and model-agnostic domain-aware prompt learning (DAP) framework. For equipping the pretrained models with specific object-level and scene-level cross-modal alignment in VLN tasks, DAP applies a low-cost prompt tuning paradigm to learn soft visual prompts for extracting in-domain image semantics. Specifically, we first generate a set of in-domain image-text pairs with the help of the CLIP model. Then we introduce soft visual prompts in the input space of the visual encoder in a pretrained model. DAP injects in-domain visual knowledge into the visual encoder of the pretrained model in an efficient way. Experimental results on both R2R and REVERIE show the superiority of DAP compared to existing state-of-the-art methods.
- “Multi-speaker pitch tracking via embodied self-supervised learning,” in Proc. ICASSP, 2022, pp. 8257–8261.
- “Reverie: Remote embodied visual referring expression in real indoor environments,” in Proc. CVPR, 2020, pp. 9982–9991.
- “Improving vision-and-language navigation with image-text pairs from the web,” in ECCV, 2020, pp. 259–274.
- “Ndc-scene: Boost monocular 3d semantic scene completion in normalized device coordinates space,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 9455–9465.
- “Depthssc: Depth-spatial alignment and dynamic voxel resolution for monocular 3d semantic scene completion,” arXiv preprint arXiv:2311.17084, 2023.
- “Towards learning a generic agent for vision-and-language navigation via pre-training,” in CVPR, 2020, pp. 13134–13143.
- “VGDiffZero: Text-to-image diffusion models can be zero-shot visual grounders,” arXiv preprint arXiv:2309.01141, 2023.
- “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
- “Learning transferable visual models from natural language supervision,” in ICML, 2021, pp. 8748–8763.
- “CPT: colorful prompt tuning for pre-trained vision-language models,” CoRR, vol. abs/2109.11797, 2021.
- “ADAPT: vision-language navigation with modality-aligned action prompts,” in CVPR, 2022, pp. 15375–15385.
- “Visual-language navigation pretraining via prompt-based environmental self-exploration,” in ACL (1), 2022, pp. 4837–4851.
- “On evaluation of embodied navigation agents,” CoRR, vol. abs/1807.06757, 2018.
- “Towards learning a generic agent for vision-and-language navigation via pre-training,” in Proc. CVPR, 2020, pp. 13137–13146.
- “VLN BERT: A recurrent vision-and-language BERT for navigation,” in CVPR, 2021, pp. 1643–1653.
- “Oscar: Object-semantics aligned pre-training for vision-language tasks,” in ECCV, 2020, pp. 121–137.
- “Reinforced vision-and-language navigation based on historical BERT,” in ICSI, 2023, pp. 427–438.
- “Robust navigation with language pretraining and stochastic sampling,” in EMNLP, 2019, pp. 1494–1499.
- “Learning to navigate unseen environments: Back translation with environmental dropout,” in NAACL, 2019, pp. 2610–2621.
- “Vision-language navigation with random environmental mixup,” in ICCV, 2021, pp. 1624–1634.
- “Vision-language navigation with self-supervised auxiliary reasoning tasks,” in Proc. IEEE CVPR, 2020, pp. 10012–10022.
- “The road to know-where: An object-and-room informed sequential BERT for indoor vision-language navigation,” in ICCV. 2021, pp. 1635–1644, IEEE.
- “Neighbor-view enhanced model for vision and language navigation,” in ACM MM, 2021, pp. 5101–5109.
- “Reinforced structured state-evolution for vision-language navigation,” in CVPR, 2022, pp. 15429–15438.
- “Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation,” in Proc. CVPR, 2019, pp. 6629–6638.
- “Self-monitoring navigation agent via auxiliary progress estimation,” in ICLR, 2019.
- “Tactical rewind: Self-correction via backtracking in vision-and-language navigation,” in Proc. CVPR, 2019, pp. 6741–6749.