Volumetric Environment Representation for Vision-Language Navigation (2403.14158v1)
Abstract: Vision-language navigation (VLN) requires an agent to navigate through an 3D environment based on visual observations and natural language instructions. It is clear that the pivotal factor for successful navigation lies in the comprehensive scene understanding. Previous VLN agents employ monocular frameworks to extract 2D features of perspective views directly. Though straightforward, they struggle for capturing 3D geometry and semantics, leading to a partial and incomplete environment representation. To achieve a comprehensive 3D representation with fine-grained details, we introduce a Volumetric Environment Representation (VER), which voxelizes the physical world into structured 3D cells. For each cell, VER aggregates multi-view 2D features into such a unified 3D space via 2D-3D sampling. Through coarse-to-fine feature extraction and multi-task learning for VER, our agent predicts 3D occupancy, 3D room layout, and 3D bounding boxes jointly. Based on online collected VERs, our agent performs volume state estimation and builds episodic memory for predicting the next step. Experimental results show our environment representations from multi-task learning lead to evident performance gains on VLN. Our model achieves state-of-the-art performance across VLN benchmarks (R2R, REVERIE, and R4R).
- Bevbert: Multimodal map pre-training for language-guided navigation. In ICCV, 2023a.
- Etpnav: Evolving topological planning for vision-language navigation in continuous environments. arXiv:2304.03047, 2023b.
- Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In CVPR, 2018.
- Chasing ghosts: Instruction following as bayesian state tracking. In NeurIPS, 2019.
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv:2303.12712, 2023.
- Semantic mapnet: Building allocentric semantic maps and representations from egocentric views. In AAAI, 2021.
- Matterport3d: Learning from rgb-d data in indoor environments. In 3DV, 2017.
- Learning to explore using active neural slam. In ICLR, 2019.
- Neural topological slam for visual navigation. In CVPR, 2020.
- Omnidirectional information gathering for knowledge transfer-based audio-visual navigation. In ICCV, 2023a.
- Topological planning with transformers for vision-and-language navigation. In CVPR, 2021a.
- Weakly-supervised multi-granularity map learning for vision-and-language navigation. In NeurIPS, 2022a.
- a2superscript𝑎2a^{2}italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT nav: Action-aware zero-shot robot navigation by exploiting vision-and-language ability of foundation models. In NeurIPS Workshop, 2023b.
- History aware multimodal transformer for vision-and-language navigation. In NeurIPS, 2021b.
- Learning from unlabeled 3d environments for vision-and-language navigation. In ECCV, 2022b.
- Think global, act local: Dual-scale graph transformer for vision-and-language navigation. In CVPR, 2022c.
- Learning priors for semantic 3d reconstruction. In ECCV, 2018.
- Manhattan world: Compass direction from a single image by bayesian inference. In ICCV, 1999.
- Evolving graphical planner: Contextual global planning for vision-and-language navigation. In NeurIPS, 2020.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2020.
- Simultaneous localization and mapping: part i. IEEE robotics & automation magazine, 13(2):99–110, 2006.
- Alberto Elfes. Occupancy grids: A stochastic spatial representation for active robot perception. arXiv:1304.1098, 2013.
- Speaker-follower models for vision-and-language navigation. In NeurIPS, 2018.
- Adaptive zone-aware hierarchical planner for vision-language navigation. In CVPR, 2023.
- Semantics for robotic mapping, perception and interaction: A survey. Foundations and Trends® in Robotics, 8(1–2):1–224, 2020.
- Cross-modal map learning for vision and language navigation. In CVPR, 2022.
- Vision-and-language navigation: A survey of tasks, methods, and future directions. In ACL, 2022.
- Airbert: In-domain pretraining for vision-and-language navigation. In ICCV, 2021.
- Recurrent world models facilitate policy evolution. In NeurIPS, 2018.
- Towards learning a generic agent for vision-and-language navigation via pre-training. In CVPR, 2020.
- Mapnet: An allocentric spatial memory for mapping environments. In CVPR, 2018.
- Language and visual entity relationship graph for agent navigation. In NeurIPS, 2020.
- Vln bert: A recurrent vision-and-language bert for navigation. In CVPR, 2021.
- Learning navigational visual representations with semantic map supervision. In CVPR, 2023.
- Scenenn: A scene meshes dataset with annotations. In 3DV, 2016.
- Tri-perspective view for vision-based 3d semantic occupancy prediction. In CVPR, 2023.
- Geovln: Learning geometry-enhanced visual representation with slot attention for vision-and-language navigation. In CVPR, 2023.
- Semantically-aware spatio-temporal reasoning agent for vision-and-language navigation in continuous environments. In ICPR, 2022.
- Stay on the path: Instruction fidelity in vision-and-language navigation. In ACL, 2019.
- Local implicit grid representations for 3d scenes. In CVPR, 2020.
- A new path: Scaling vision-and-language navigation with synthetic instructions and imitation learning. In CVPR, 2023.
- Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019.
- Adam: A method for stochastic optimization. In ICLR, 2015.
- Semi-supervised classification with graph convolutional networks. In ICLR, 2016.
- Pathdreamer: A world model for indoor navigation. In ICCV, 2021.
- Beyond the nav-graph: Vision-and-language navigation in continuous environments. In ECCV, 2020.
- Renderable neural radiance map for visual navigation. In CVPR, 2023.
- Delving into the devils of bird’s-eye-view perception: A review, evaluation and recipe. IEEE TPAMI, 2023.
- Panogen: Text-conditioned panoramic environment generation for vision-and-language navigation. In NeurIPS, 2023.
- Envedit: Environment editing for vision-and-language navigation. In CVPR, 2022a.
- 3d neural scene representations for visuomotor control. In CoRL, 2022b.
- Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In ECCV, 2022c.
- Focal loss for dense object detection. In ICCV, 2017.
- Scene-intuitive agent for remote embodied visual grounding. In CVPR, 2021.
- Vision-language navigation with random environmental mixup. In ICCV, 2021.
- Bird’s-eye-view scene graph for vision-language navigation. In ICCV, 2023a.
- See and think: Disentangling semantic scene completion. In NeurIPS, 2018.
- Aerialvln: Vision-and-language navigation for uavs. In ICCV, 2023b.
- Improving vision-and-language navigation with image-text pairs from the web. In ECCV, 2020.
- Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
- Soat: A scene-and object-aware transformer for vision-and-language navigation. In NeurIPS, 2021.
- Neural map: Structured memory for deep reinforcement learning. In ICLR, 2018.
- Episodic transformer for vision-and-language navigation. In ICCV, 2021.
- Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In ECCV, 2020.
- Reverie: Remote embodied visual referring expression in real indoor environments. In CVPR, 2020.
- Hop: history-and-order aware pre-training for vision-and-language navigation. In CVPR, 2022.
- Generalized intersection over union: A metric and a loss for bounding box regression. In CVPR, 2019.
- 3d semantic scene completion: A survey. IJCV, 130(8):1978–2005, 2022.
- A reduction of imitation learning and structured prediction to no-regret online learning. In AISTATS, 2011.
- Semantic scene completion from a single depth image. In CVPR, 2017.
- Sequence to sequence learning with neural networks. In NeurIPS, 2014.
- Lxmert: Learning cross-modality encoder representations from transformers. In EMNLP, 2019.
- Learning to navigate unseen environments: Back translation with environmental dropout. In NAACL, 2019.
- Self-supervised 3d semantic representation learning for vision-and-language navigation. arXiv:2201.10788, 2022.
- Sebastian Thrun. Probabilistic robotics. Communications of the ACM, 45(3):52–57, 2002.
- Occ3d: A large-scale 3d occupancy prediction benchmark for autonomous driving. arXiv:2304.14365, 2023.
- Scene as occupancy. In ICCV, 2023.
- Attention is all you need. In NeurIPS, 2017.
- Learning 3d semantic scene graphs from 3d indoor reconstructions. In CVPR, 2020.
- Active visual information gathering for vision-language navigation. In ECCV, 2020a.
- Structured scene memory for vision-language navigation. In CVPR, 2021.
- Towards versatile embodied navigation. In NeurIPS, 2022a.
- Counterfactual cycle-consistent learning for instruction following and generation in vision-language navigation. In CVPR, 2022b.
- Dreamwalker: Mental planning for continuous vision-language navigation. In ICCV, 2023a.
- Active perception for visual-language navigation. IJCV, 131(3):607–625, 2023b.
- Embodiedscan: A holistic multi-modal 3d perception suite towards embodied ai. In CVPR, 2024a.
- Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In CVPR, 2019.
- Lana: A language-capable navigator for instruction following and generation. In CVPR, 2023c.
- Learning to follow and generate instructions for language-capable navigation. IEEE TPAMI, 2023d.
- Pillar-based object detection for autonomous driving. In ECCV, 2020b.
- Panoocc: Unified occupancy representation for camera-based 3d panoptic segmentation. In CVPR, 2024b.
- Scaling data generation in vision-and-language navigation. In ICCV, 2023e.
- Gridmm: Grid memory map for vision-and-language navigation. In ICCV, 2023f.
- Surroundocc: Multi-camera 3d occupancy prediction for autonomous driving. In ICCV, 2023.
- Vector-decomposed disentanglement for domain-invariant object detection. In ICCV, 2021.
- Heightformer: Explicit height modeling without extra data for camera-only 3d object detection in bird’s eye view. arXiv:2307.13510, 2023.
- Learning 3d dynamic scene representations for robot manipulation. In CoRL, 2020.
- Habitat challenge 2023. https://aihabitat.org/challenge/2023/, 2023.
- Efficient semantic scene completion network with spatial group convolution. In ECCV, 2018.
- Target-driven structured transformer planner for vision-language navigation. In ACM MM, 2022.
- Navgpt: Explicit reasoning in vision-and-language navigation with large language models. In AAAI, 2024.
- Objects as points. arXiv:1904.07850, 2019.
- Vision-language navigation with self-supervised auxiliary reasoning tasks. In CVPR, 2020a.
- Deformable detr: Deformable transformers for end-to-end object detection. In ICLR, 2020b.
- Layoutnet: Reconstructing the 3d room layout from a single rgb image. In CVPR, 2018.
- Manhattan room layout reconstruction from a single 360 image: A comparative study of state-of-the-art methods. IJCV, 129:1410–1431, 2021.
- Rui Liu (320 papers)
- Wenguan Wang (103 papers)
- Yi Yang (855 papers)