Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

End-to-End (Instance)-Image Goal Navigation through Correspondence as an Emergent Phenomenon (2309.16634v1)

Published 28 Sep 2023 in cs.CV

Abstract: Most recent work in goal oriented visual navigation resorts to large-scale machine learning in simulated environments. The main challenge lies in learning compact representations generalizable to unseen environments and in learning high-capacity perception modules capable of reasoning on high-dimensional input. The latter is particularly difficult when the goal is not given as a category ("ObjectNav") but as an exemplar image ("ImageNav"), as the perception module needs to learn a comparison strategy requiring to solve an underlying visual correspondence problem. This has been shown to be difficult from reward alone or with standard auxiliary tasks. We address this problem through a sequence of two pretext tasks, which serve as a prior for what we argue is one of the main bottleneck in perception, extremely wide-baseline relative pose estimation and visibility prediction in complex scenes. The first pretext task, cross-view completion is a proxy for the underlying visual correspondence problem, while the second task addresses goal detection and finding directly. We propose a new dual encoder with a large-capacity binocular ViT model and show that correspondence solutions naturally emerge from the training signals. Experiments show significant improvements and SOTA performance on the two benchmarks, ImageNav and the Instance-ImageNav variant, where camera intrinsics and height differ between observation and goal.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (77)
  1. Zero experience required: Plug & play modular transfer learning for semantic visual navigation. In CVPR, 2022.
  2. On evaluation of embodied navigation agents. arXiv preprint, 2018.
  3. RelocNet: Continuous Metric Learning Relocalisation Using Neural Nets. In ECCV, 2018.
  4. Learning to reason on uncertain topological maps. In ECCV, 2020a.
  5. Egomap: Projective mapping and structured egocentric memory for deep RL. In ECML-PKDD, 2020b.
  6. Romain Brégier. Deep regression on manifolds: a 3D rotation case study. In Intern. Conf. 3D Vision (3DV), 2021.
  7. Simultaneous localization and mapping: A survey of current trends in autonomous driving. IEEE Transactions on Intelligent Vehicles, 2017.
  8. The interactive museum tour-guide robot. In Aaai/iaai, pp.  11–18, 1998.
  9. Emerging properties in self-supervised vision transformers. In ICCV, 2021.
  10. Matterport3d: Learning from rgb-d data in indoor environments. In 3DV, 2018.
  11. Object goal navigation using goal-oriented semantic exploration. In NeurIPS, 2020a.
  12. Learning to explore using active neural slam. In ICLR, 2020b.
  13. Neural topological slam for visual navigation. In CVPR, 2020c.
  14. Robustnav: Towards benchmarking robustness in embodied navigation. CoRR, 2106.04531, 2021.
  15. Wide-baseline relative camera pose estimation with directional learning. In CVPR, 2021.
  16. Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation. In CVPR, 2022a.
  17. AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition. In NeurIPS, 2022b.
  18. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Conference on Empirical Methods in Natural Language Processing, 2014.
  19. Neural modular control for embodied question answering. In CORL, 2018a.
  20. Embodied Question Answering. In CVPR, 2018b.
  21. BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, pp.  4171–4186, 2019.
  22. Goal-conditioned imitation learning. In NeurIPS, 2019.
  23. VTNet: Visual Transformer Network for Object Goal Navigation. In ICLR, 2021.
  24. Scene memory transformer for embodied agents in long-horizon tasks. In CVPR, 2019.
  25. Self-supervised correspondence in visuomotor policy learning. IEEE Robotics Autom. Lett., 5(2):492–499, 2020.
  26. The dynamic window approach to collision avoidance. IEEE Robotics & Automation Magazine, 4(1):23–33, 1997.
  27. Deep residual learning for image recognition. In CVPR, 2016.
  28. Masked autoencoders are scalable vision learners. In CVPR, 2022.
  29. Mapnet: An allocentric spatial memory for mapping environments. In CVPR, 2018.
  30. Investigating the role of image retrieval for visual localization. International Journal of Computer Vision, 2022.
  31. Reinforcement learning with unsupervised auxiliary tasks. In ICLR, 2017.
  32. Image matching across wide baselines: From paper to practice. Int. J. Comput. Vis., 129(2):517–547, 2021.
  33. PoseNet: a Convolutional Network for Real-Time 6-DOF Camera Relocalization. In ICCV, 2015.
  34. Camera localization with siamese neural networks using iterative relative pose estimation. J. Comput. Des. Eng., 9(4):1482–1497, 2022.
  35. Kurt Konolige. A gradient method for realtime robot control. In IROS, 2000.
  36. Instance-Specific Image Goal Navigation: Training Embodied Agents to Find Object Instances. In 2211.15876, 2022.
  37. Navigating to objects specified by images. In ICCV, 2023.
  38. RTAB-Map as an open-source lidar and visual simultaneous localization and mapping library for large-scale and long-term online operation. Journal of Field Robotics, 36(2):416–446, 2019.
  39. Pose recognition with cascade transformers. In CVPR, 2021.
  40. Active Mapping and Robot Exploration: A Survey. Sensors, 21(7):2445, 2021.
  41. The marathon 2: A navigation system. In IROS, 2020.
  42. ZSON: zero-shot object-goal navigation using multimodal goal embeddings. In NeurIPS, 2022a.
  43. SSL enables learning from sparse rewards in image-goal navigation. In ICML, 2022b.
  44. Where are we in the search for an artificial visual cortex for embodied intelligence? In arXiv:2303.18240, 2023.
  45. THDA: treasure hunt data augmentation for semantic navigation. In ICCV, 2021.
  46. The office marathon: Robust navigation in an indoor office environment. In ICRA, 2010.
  47. Teaching agents how to map: Spatial reasoning for multi-object navigation. In IROS, 2022.
  48. Multi-Object Navigation with dynamically learned neural implicit representations. In ICCV, 2023.
  49. Relative camera pose estimation using convolutional neural networks. In ACIVS, 2017.
  50. Memory-augmented reinforcement learning for image-goal navigation. In IROS, 2022.
  51. Learning to navigate in complex environments. In ICLR, 2017.
  52. ORB-SLAM: A versatile and accurate monocular SLAM system. IEEE Trans. Robotics, 31(5):1147–1163, 2015.
  53. Neural map: Structured memory for deep reinforcement learning. In ICLR, 2018.
  54. Episodic transformer for vision-and-language navigation. In ICCV, 2021.
  55. Real-world robot learning with masked visual pre-training. In Conference on Robot Learning (CoRL), volume 205, pp. 416–426, 2022.
  56. Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2021.
  57. PONI: Potential Functions for ObjectGoal Navigation with Interaction-free Learning. In CVPR, 2022.
  58. A Generalist Agent. TMLR, 2022.
  59. R2D2: Reliable and Repeatable Detector and Descriptor. In NeurIPS, 2019.
  60. Timed-elastic-bands for time-optimal point-to-point nonlinear model predictive control. In European Control Conference (ECC), 2015.
  61. SuperGlue: Learning feature matching with graph neural networks. In CVPR, 2020.
  62. Habitat: A platform for embodied ai research. In ICCV, 2019.
  63. Proximal policy optimization algorithms. arXiv preprint, 2017.
  64. James A Sethian. A fast marching level set method for monotonically advancing fronts. PNAS, 93(4):1591–1595, 1996.
  65. ViKiNG: Vision-Based Kilometer-Scale Navigation with Geographic Hints. In RSS, 2022.
  66. Probabilistic robotics, vol. 1. MIT Press Cambridge, 2005.
  67. Visual pre-training for navigation: What can we learn from noise? In NeurIPS Workshop, 2022.
  68. CroCo: Self-Supervised Pretraining for 3D Vision Tasks by Cross-View Completion. In NeurIPS, 2022.
  69. Dd-ppo: Learning near-perfect pointgoal navigators from 2.5 billion frames. In ICLR, 2019.
  70. Image-goal navigation in complex environments via modular learning. IEEE Robotics Autom. Lett., 7(3):6902–6909, 2022.
  71. Gibson env: Real-world perception for embodied agents. In CVPR, 2018.
  72. Pretraining in deep reinforcement learning: A survey. In ArXiv:2211.03959, 2022.
  73. A critical analysis of image-based camera pose estimation techniques. In arXiv:2201.05816, 2022.
  74. Offline visual representation learning for embodied navigation. In arXiv:2204.13226, 2022.
  75. OVRL-V2: A simple state-of-art baseline for ImageNav and ObjectNav. In arXiv:2303.07798, 2023.
  76. Florence: A new foundation model for computer vision. In arXiv:2111.11432, 2021.
  77. Target-driven visual navigation in indoor scenes using deep reinforcement learning. In ICRA, 2017.
Citations (5)

Summary

We haven't generated a summary for this paper yet.