End-to-End (Instance)-Image Goal Navigation through Correspondence as an Emergent Phenomenon (2309.16634v1)
Abstract: Most recent work in goal oriented visual navigation resorts to large-scale machine learning in simulated environments. The main challenge lies in learning compact representations generalizable to unseen environments and in learning high-capacity perception modules capable of reasoning on high-dimensional input. The latter is particularly difficult when the goal is not given as a category ("ObjectNav") but as an exemplar image ("ImageNav"), as the perception module needs to learn a comparison strategy requiring to solve an underlying visual correspondence problem. This has been shown to be difficult from reward alone or with standard auxiliary tasks. We address this problem through a sequence of two pretext tasks, which serve as a prior for what we argue is one of the main bottleneck in perception, extremely wide-baseline relative pose estimation and visibility prediction in complex scenes. The first pretext task, cross-view completion is a proxy for the underlying visual correspondence problem, while the second task addresses goal detection and finding directly. We propose a new dual encoder with a large-capacity binocular ViT model and show that correspondence solutions naturally emerge from the training signals. Experiments show significant improvements and SOTA performance on the two benchmarks, ImageNav and the Instance-ImageNav variant, where camera intrinsics and height differ between observation and goal.
- Zero experience required: Plug & play modular transfer learning for semantic visual navigation. In CVPR, 2022.
- On evaluation of embodied navigation agents. arXiv preprint, 2018.
- RelocNet: Continuous Metric Learning Relocalisation Using Neural Nets. In ECCV, 2018.
- Learning to reason on uncertain topological maps. In ECCV, 2020a.
- Egomap: Projective mapping and structured egocentric memory for deep RL. In ECML-PKDD, 2020b.
- Romain Brégier. Deep regression on manifolds: a 3D rotation case study. In Intern. Conf. 3D Vision (3DV), 2021.
- Simultaneous localization and mapping: A survey of current trends in autonomous driving. IEEE Transactions on Intelligent Vehicles, 2017.
- The interactive museum tour-guide robot. In Aaai/iaai, pp. 11–18, 1998.
- Emerging properties in self-supervised vision transformers. In ICCV, 2021.
- Matterport3d: Learning from rgb-d data in indoor environments. In 3DV, 2018.
- Object goal navigation using goal-oriented semantic exploration. In NeurIPS, 2020a.
- Learning to explore using active neural slam. In ICLR, 2020b.
- Neural topological slam for visual navigation. In CVPR, 2020c.
- Robustnav: Towards benchmarking robustness in embodied navigation. CoRR, 2106.04531, 2021.
- Wide-baseline relative camera pose estimation with directional learning. In CVPR, 2021.
- Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation. In CVPR, 2022a.
- AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition. In NeurIPS, 2022b.
- Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Conference on Empirical Methods in Natural Language Processing, 2014.
- Neural modular control for embodied question answering. In CORL, 2018a.
- Embodied Question Answering. In CVPR, 2018b.
- BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, pp. 4171–4186, 2019.
- Goal-conditioned imitation learning. In NeurIPS, 2019.
- VTNet: Visual Transformer Network for Object Goal Navigation. In ICLR, 2021.
- Scene memory transformer for embodied agents in long-horizon tasks. In CVPR, 2019.
- Self-supervised correspondence in visuomotor policy learning. IEEE Robotics Autom. Lett., 5(2):492–499, 2020.
- The dynamic window approach to collision avoidance. IEEE Robotics & Automation Magazine, 4(1):23–33, 1997.
- Deep residual learning for image recognition. In CVPR, 2016.
- Masked autoencoders are scalable vision learners. In CVPR, 2022.
- Mapnet: An allocentric spatial memory for mapping environments. In CVPR, 2018.
- Investigating the role of image retrieval for visual localization. International Journal of Computer Vision, 2022.
- Reinforcement learning with unsupervised auxiliary tasks. In ICLR, 2017.
- Image matching across wide baselines: From paper to practice. Int. J. Comput. Vis., 129(2):517–547, 2021.
- PoseNet: a Convolutional Network for Real-Time 6-DOF Camera Relocalization. In ICCV, 2015.
- Camera localization with siamese neural networks using iterative relative pose estimation. J. Comput. Des. Eng., 9(4):1482–1497, 2022.
- Kurt Konolige. A gradient method for realtime robot control. In IROS, 2000.
- Instance-Specific Image Goal Navigation: Training Embodied Agents to Find Object Instances. In 2211.15876, 2022.
- Navigating to objects specified by images. In ICCV, 2023.
- RTAB-Map as an open-source lidar and visual simultaneous localization and mapping library for large-scale and long-term online operation. Journal of Field Robotics, 36(2):416–446, 2019.
- Pose recognition with cascade transformers. In CVPR, 2021.
- Active Mapping and Robot Exploration: A Survey. Sensors, 21(7):2445, 2021.
- The marathon 2: A navigation system. In IROS, 2020.
- ZSON: zero-shot object-goal navigation using multimodal goal embeddings. In NeurIPS, 2022a.
- SSL enables learning from sparse rewards in image-goal navigation. In ICML, 2022b.
- Where are we in the search for an artificial visual cortex for embodied intelligence? In arXiv:2303.18240, 2023.
- THDA: treasure hunt data augmentation for semantic navigation. In ICCV, 2021.
- The office marathon: Robust navigation in an indoor office environment. In ICRA, 2010.
- Teaching agents how to map: Spatial reasoning for multi-object navigation. In IROS, 2022.
- Multi-Object Navigation with dynamically learned neural implicit representations. In ICCV, 2023.
- Relative camera pose estimation using convolutional neural networks. In ACIVS, 2017.
- Memory-augmented reinforcement learning for image-goal navigation. In IROS, 2022.
- Learning to navigate in complex environments. In ICLR, 2017.
- ORB-SLAM: A versatile and accurate monocular SLAM system. IEEE Trans. Robotics, 31(5):1147–1163, 2015.
- Neural map: Structured memory for deep reinforcement learning. In ICLR, 2018.
- Episodic transformer for vision-and-language navigation. In ICCV, 2021.
- Real-world robot learning with masked visual pre-training. In Conference on Robot Learning (CoRL), volume 205, pp. 416–426, 2022.
- Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2021.
- PONI: Potential Functions for ObjectGoal Navigation with Interaction-free Learning. In CVPR, 2022.
- A Generalist Agent. TMLR, 2022.
- R2D2: Reliable and Repeatable Detector and Descriptor. In NeurIPS, 2019.
- Timed-elastic-bands for time-optimal point-to-point nonlinear model predictive control. In European Control Conference (ECC), 2015.
- SuperGlue: Learning feature matching with graph neural networks. In CVPR, 2020.
- Habitat: A platform for embodied ai research. In ICCV, 2019.
- Proximal policy optimization algorithms. arXiv preprint, 2017.
- James A Sethian. A fast marching level set method for monotonically advancing fronts. PNAS, 93(4):1591–1595, 1996.
- ViKiNG: Vision-Based Kilometer-Scale Navigation with Geographic Hints. In RSS, 2022.
- Probabilistic robotics, vol. 1. MIT Press Cambridge, 2005.
- Visual pre-training for navigation: What can we learn from noise? In NeurIPS Workshop, 2022.
- CroCo: Self-Supervised Pretraining for 3D Vision Tasks by Cross-View Completion. In NeurIPS, 2022.
- Dd-ppo: Learning near-perfect pointgoal navigators from 2.5 billion frames. In ICLR, 2019.
- Image-goal navigation in complex environments via modular learning. IEEE Robotics Autom. Lett., 7(3):6902–6909, 2022.
- Gibson env: Real-world perception for embodied agents. In CVPR, 2018.
- Pretraining in deep reinforcement learning: A survey. In ArXiv:2211.03959, 2022.
- A critical analysis of image-based camera pose estimation techniques. In arXiv:2201.05816, 2022.
- Offline visual representation learning for embodied navigation. In arXiv:2204.13226, 2022.
- OVRL-V2: A simple state-of-art baseline for ImageNav and ObjectNav. In arXiv:2303.07798, 2023.
- Florence: A new foundation model for computer vision. In arXiv:2111.11432, 2021.
- Target-driven visual navigation in indoor scenes using deep reinforcement learning. In ICRA, 2017.