VLN-Video: Utilizing Driving Videos for Outdoor Vision-and-Language Navigation (2402.03561v2)
Abstract: Outdoor Vision-and-Language Navigation (VLN) requires an agent to navigate through realistic 3D outdoor environments based on natural language instructions. The performance of existing VLN methods is limited by insufficient diversity in navigation environments and limited training data. To address these issues, we propose VLN-Video, which utilizes the diverse outdoor environments present in driving videos in multiple cities in the U.S. augmented with automatically generated navigation instructions and actions to improve outdoor VLN performance. VLN-Video combines the best of intuitive classical approaches and modern deep learning techniques, using template infilling to generate grounded navigation instructions, combined with an image rotation similarity-based navigation action predictor to obtain VLN style data from driving videos for pretraining deep learning VLN models. We pre-train the model on the Touchdown dataset and our video-augmented dataset created from driving videos with three proxy tasks: Masked LLMing, Instruction and Trajectory Matching, and Next Action Prediction, so as to learn temporally-aware and visually-aligned instruction representations. The learned instruction representation is adapted to the state-of-the-art navigator when fine-tuning on the Touchdown dataset. Empirical results demonstrate that VLN-Video significantly outperforms previous state-of-the-art models by 2.1% in task completion rate, achieving a new state-of-the-art on the Touchdown dataset.
- Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675.
- Contextual String Embeddings for Sequence Labeling. In COLING 2018, 27th International Conference on Computational Linguistics, 1638–1649.
- Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3674–3683.
- A Priority Map for Vision-and-Language Navigation with Trajectory Plans and Feature-Location Cues. arXiv preprint arXiv:2207.11717.
- Video pretraining (vpt): Learning to act by watching unlabeled online videos. arXiv preprint arXiv:2206.11795.
- Grounding Language to Landmarks in Arbitrary Outdoor Environments. In 2020 IEEE International Conference on Robotics and Automation (ICRA), 208–215. IEEE.
- Learning to Map Natural Language Instructions to Physical Quadcopter Control using Simulated Flight. In Conference on Robot Learning, 1415–1438.
- Deepnav: Learning to navigate large cities. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5193–5202.
- Matterport3d: Learning from rgb-d data in indoor environments. arXiv preprint arXiv:1709.06158.
- Touchdown: Natural Language Navigation and Spatial Reasoning in Visual Street Environments. In Conference on Computer Vision and Pattern Recognition.
- Weakly-supervised multi-granularity map learning for vision-and-language navigation. arXiv preprint arXiv:2210.07506.
- History aware multimodal transformer for vision-and-language navigation. Advances in Neural Information Processing Systems, 34: 5834–5847.
- Learning from Unlabeled 3D Environments for Vision-and-Language Navigation. In European Conference on Computer Vision, 638–655. Springer.
- Bdd100k: A diverse driving dataset for heterogeneous multitask learning. arXiv preprint arXiv: 1805.04687.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 248–255. Ieee.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- FOAM: A Follower-aware Speaker Model For Vision-and-Language Navigation. arXiv preprint arXiv:2206.04294.
- ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 961–970.
- Speaker-follower models for vision-and-language navigation. Advances in Neural Information Processing Systems, 31.
- Counterfactual vision-and-language navigation via adversarial path sampler. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16, 71–86. Springer.
- Cross-modal map learning for vision and language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15460–15470.
- The” something something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE international conference on computer vision, 5842–5850.
- Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18995–19012.
- Airbert: In-domain pretraining for vision-and-language navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 1634–1643.
- Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 5356–5364.
- Towards learning a generic agent for vision-and-language navigation via pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13137–13146.
- Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, 2961–2969.
- Landmark-RxR: Solving Vision-and-Language Navigation with Fine-Grained Alignment Supervision. Advances in Neural Information Processing Systems, 34: 652–663.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
- Grounded language learning in a simulated 3d world. arXiv preprint arXiv:1706.06551.
- Learning to follow directions in street view. Thirty-Fourth AAAI Conference on Artificial Intelligence.
- Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2758–2766.
- The kinetics human action video dataset. arXiv preprint arXiv:1705.06950.
- Ndh-full: Learning and evaluating navigational agents on full-length dialogue. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 6432–6442.
- Dense-captioning events in videos. In Proceedings of the IEEE international conference on computer vision, 706–715.
- Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 4392–4412.
- Tvqa: Localized, compositional video question answering. arXiv preprint arXiv:1809.01696.
- EnvEdit: Environment Editing for Vision-and-Language Navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15407–15417.
- ADAPT: Vision-Language Navigation with Modality-Aligned Action Prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15396–15406.
- Multimodal transformer with variable-length memory for vision-and-language navigation. In European Conference on Computer Vision, 380–397. Springer.
- Vision-language navigation with random environmental mixup. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 1644–1654.
- Improving vision-and-language navigation with image-text pairs from the web. In European Conference on Computer Vision, 259–274. Springer.
- Retouchdown: Adding touchdown to streetlearn as a shareable resource for language grounding tasks in street view. arXiv preprint arXiv:2001.03671.
- The streetlearn environment and dataset. arXiv preprint arXiv:1903.01292.
- Learning to navigate in cities without a map. In Advances in Neural Information Processing Systems, 2419–2430.
- Mapping Instructions to Actions in 3D Environments with Visual Goal Prediction. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2667–2678.
- Help, Anna! Visual Navigation with Natural Multimodal Assistance via Retrospective Curiosity-Encouraging Imitation Learning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).
- Detecting activities of daily living in first-person camera views. In 2012 IEEE conference on computer vision and pattern recognition, 2847–2854. IEEE.
- REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
- HOP: History-and-Order Aware Pre-training for Vision-and-Language Navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15418–15427.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8): 9.
- A dataset for movie description. In Proceedings of the IEEE conference on computer vision and pattern recognition, 3202–3212.
- Analyzing Generalization of Vision and Language Navigation to Unseen Outdoor Areas. arXiv preprint arXiv:2203.13838.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2556–2565.
- ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
- Learning to navigate unseen environments: Back translation with environmental dropout. arXiv preprint arXiv:1904.04195.
- Vision-and-Dialog Navigation. In Conference on Robot Learning (CoRL).
- Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6629–6638.
- Detectron2. https://github.com/facebookresearch/detectron2.
- Learning to stop: A simple yet effective approach to urban vision-language navigation. arXiv preprint arXiv:2009.13112.
- Video question answering via gradually refined attention over appearance and motion. In Proceedings of the 25th ACM international conference on Multimedia, 1645–1653.
- Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, 5288–5296.
- Learning to Drive by Watching YouTube Videos: Action-Conditioned Contrastive Policy Pretraining. ECCV, 2(4): 5.
- Explicit Object Relation Alignment for Vision and Language Navigation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, 322–331.
- On the evaluation of vision-and-language navigation instructions. arXiv preprint arXiv:2101.10504.
- Rethinking the Spatial Route Prior in Vision-and-Language Navigation. arXiv preprint arXiv:2110.05728.
- Multimodal Text Style Transfer for Outdoor Vision-and-Language Navigation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 1207–1221. Online: Association for Computational Linguistics.
- Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books. In The IEEE International Conference on Computer Vision (ICCV).
- Jialu Li (53 papers)
- Aishwarya Padmakumar (17 papers)
- Gaurav Sukhatme (30 papers)
- Mohit Bansal (304 papers)