Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RoboHop: Segment-based Topological Map Representation for Open-World Visual Navigation (2405.05792v1)

Published 9 May 2024 in cs.RO, cs.AI, cs.CV, cs.HC, and cs.LG

Abstract: Mapping is crucial for spatial reasoning, planning and robot navigation. Existing approaches range from metric, which require precise geometry-based optimization, to purely topological, where image-as-node based graphs lack explicit object-level reasoning and interconnectivity. In this paper, we propose a novel topological representation of an environment based on "image segments", which are semantically meaningful and open-vocabulary queryable, conferring several advantages over previous works based on pixel-level features. Unlike 3D scene graphs, we create a purely topological graph with segments as nodes, where edges are formed by a) associating segment-level descriptors between pairs of consecutive images and b) connecting neighboring segments within an image using their pixel centroids. This unveils a "continuous sense of a place", defined by inter-image persistence of segments along with their intra-image neighbours. It further enables us to represent and update segment-level descriptors through neighborhood aggregation using graph convolution layers, which improves robot localization based on segment-level retrieval. Using real-world data, we show how our proposed map representation can be used to i) generate navigation plans in the form of "hops over segments" and ii) search for target objects using natural language queries describing spatial relations of objects. Furthermore, we quantitatively analyze data association at the segment level, which underpins inter-image connectivity during mapping and segment-level localization when revisiting the same place. Finally, we show preliminary trials on segment-level `hopping' based zero-shot real-world navigation. Project page with supplementary details: oravus.github.io/RoboHop/

Definition Search Book Streamline Icon: https://streamlinehq.com
References (69)
  1. K. Jatavallabhula, A. Kuwajerwala, Q. Gu, M. Omama, T. Chen, S. Li, G. Iyer, S. Saryazdi, N. Keetha, A. Tewari, J. Tenenbaum, C. de Melo, M. Krishna, L. Paull, F. Shkurti, and A. Torralba, “Conceptfusion: Open-set multimodal 3d mapping,” in RSS, 2023.
  2. P.-E. Sarlin, M. Dusmanu, J. L. Schönberger, P. Speciale, L. Gruber, V. Larsson, O. Miksik, and M. Pollefeys, “Lamar: Benchmarking localization and mapping for augmented reality,” in European Conference on Computer Vision.   Springer, 2022, pp. 686–704.
  3. P. Wu, A. Escontrela, D. Hafner, P. Abbeel, and K. Goldberg, “Daydreamer: World models for physical robot learning,” in Conference on Robot Learning.   PMLR, 2023, pp. 2226–2240.
  4. Z. Ravichandran, L. Peng, N. Hughes, J. D. Griffith, and L. Carlone, “Hierarchical representations and explicit memory: Learning effective navigation policies on 3d scene graphs using graph neural networks,” in 2022 International Conference on Robotics and Automation (ICRA).   IEEE, 2022, pp. 9272–9279.
  5. N. Savinov, A. Dosovitskiy, and V. Koltun, “Semi-parametric topological memory for navigation,” arXiv preprint arXiv:1803.00653, 2018.
  6. D. Shah, B. Osinski, B. Ichter, and S. Levine, “LM-nav: Robotic navigation with large pre-trained models of language, vision, and action,” in 6th Annual Conference on Robot Learning, 2022. [Online]. Available: https://openreview.net/forum?id=UW5A3SweAH
  7. K. Chen, J. P. de Vicente, G. Sepulveda, F. Xia, A. Soto, M. Vazquez, and S. Savarese, “A behavioral approach to visual navigation with graph localization networks,” in Proceedings of Robotics: Science and Systems, FreiburgimBreisgau, Germany, June 2019.
  8. D. S. Chaplot, R. Salakhutdinov, A. Gupta, and S. Gupta, “Neural topological slam for visual navigation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 12 875–12 884.
  9. Y. Li and J. Košecka, “Learning view and target invariant visual servoing for navigation,” in 2020 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2020, pp. 658–664.
  10. X. Meng, N. Ratliff, Y. Xiang, and D. Fox, “Scaling local control to large-scale topological navigation,” in 2020 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2020, pp. 672–678.
  11. E. Johns and G.-Z. Yang, “Global localization in a dense continuous topological map,” in 2011 IEEE International Conference on Robotics and Automation.   IEEE, 2011, pp. 1032–1037.
  12. A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, P. Dollár, and R. Girshick, “Segment anything,” 2023.
  13. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning.   PMLR, 2021, pp. 8748–8763.
  14. S. Garg, T. Fischer, and M. Milford, “Where is your place, visual place recognition?” in Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21).   International Joint Conferences on Artificial Intelligence, 2021, pp. 4416–4425.
  15. R. Mur-Artal and J. D. Tardós, “ORB-SLAM2: an open-source SLAM system for monocular, stereo and RGB-D cameras,” IEEE Transactions on Robotics, vol. 33, no. 5, pp. 1255–1262, 2017.
  16. J. Engel, T. Schöps, and D. Cremers, “LSD-SLAM: Large-scale direct monocular SLAM,” in European Conference on Computer Vision (ECCV), September 2014.
  17. J. L. Schonberger and J.-M. Frahm, “Structure-from-motion revisited,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4104–4113.
  18. R. Dube, A. Cramariuc, D. Dugas, H. Sommer, M. Dymczyk, J. Nieto, R. Siegwart, and C. Cadena, “Segmap: Segment-based mapping and localization using data-driven descriptors,” The International Journal of Robotics Research, vol. 39, no. 2-3, pp. 339–355, 2020.
  19. M. J. Cummins and P. M. Newman, “Fab-map: Appearance-based place recognition and mapping using a learned visual vocabulary model,” in Proceedings of the 27th International Conference on Machine Learning (ICML-10), 2010, pp. 3–10.
  20. A. Rosinol, A. Violette, M. Abate, N. Hughes, Y. Chang, J. Shi, A. Gupta, and L. Carlone, “Kimera: From slam to spatial perception with 3d dynamic scene graphs,” The International Journal of Robotics Research, vol. 40, no. 12-14, pp. 1510–1546, 2021.
  21. I. Armeni, Z.-Y. He, J. Gwak, A. R. Zamir, M. Fischer, J. Malik, and S. Savarese, “3d scene graph: A structure for unified semantics, 3d space, and camera,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 5664–5673.
  22. U.-H. Kim, J.-M. Park, T.-J. Song, and J.-H. Kim, “3-D scene graph: A sparse and semantic representation of physical environments for intelligent agents,” IEEE transactions on cybernetics, vol. 50, no. 12, pp. 4921–4933, 2019.
  23. P. Gay, J. Stuart, and A. Del Bue, “Visual graphs from motion (vgfm): Scene understanding with object geometry reasoning,” in Computer Vision–ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers, Part III 14.   Springer, 2019, pp. 330–346.
  24. G. Klein and D. Murray, “Parallel tracking and mapping for small ar workspaces,” in 2007 6th IEEE and ACM International Symposium on Mixed and Augmented Reality, 2007, pp. 225–234.
  25. R. F. Salas-Moreno, R. A. Newcombe, H. Strasdat, P. H. Kelly, and A. J. Davison, “Slam++: Simultaneous localisation and mapping at the level of objects,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2013, pp. 1352–1359.
  26. L. Nicholson, M. Milford, and N. Sünderhauf, “Quadricslam: Dual quadrics from object detections as landmarks in object-oriented slam,” IEEE Robotics and Automation Letters, 2019.
  27. P. Anderson, A. Chang, D. S. Chaplot, A. Dosovitskiy, S. Gupta, V. Koltun, J. Kosecka, J. Malik, R. Mottaghi, M. Savva, et al., “On evaluation of embodied navigation agents,” arXiv preprint arXiv:1807.06757, 2018.
  28. T. Chen, S. Gupta, and A. Gupta, “Learning exploration policies for navigation,” in International Conference on Learning Representations, 2019. [Online]. Available: https://openreview.net/pdf?id=SyMWn05F7
  29. A. Wahid, A. Stone, K. Chen, B. Ichter, and A. Toshev, “Learning object-conditioned exploration using distributed soft actor critic,” in Conference on Robot Learning.   PMLR, 2021, pp. 1684–1695.
  30. Y. Lee, A. Szot, S.-H. Sun, and J. J. Lim, “Generalizable imitation learning from observation via inferring goal proximity,” in Advances in Neural Information Processing Systems, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, Eds., 2021. [Online]. Available: https://openreview.net/forum?id=lp9foO8AFoD
  31. R. Ramrakhya, D. Batra, E. Wijmans, and A. Das, “Pirlnav: Pretraining with imitation and rl finetuning for objectnav,” 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2023. [Online]. Available: http://dx.doi.org/10.1109/CVPR52729.2023.01716
  32. J. Chen, G. Li, S. Kumar, B. Ghanem, and F. Yu, “How to not train your dragon: Training-free embodied object goal navigation with semantic frontiers,” 2023.
  33. N. Kim, O. Kwon, H. Yoo, Y. Choi, J. Park, and S. Oh, “Topological Semantic Graph Memory for Image Goal Navigation,” in CoRL, 2022.
  34. D. Shah, A. Sridhar, N. Dashora, K. Stachowicz, K. Black, N. Hirose, and S. Levine, “Vint: A large-scale, multi-task visual navigation backbone with cross-robot generalization,” in 7th Annual Conference on Robot Learning, 2023.
  35. C. Huang, O. Mees, A. Zeng, and W. Burgard, “Visual language maps for robot navigation,” in Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), London, UK, 2023.
  36. Q. Gu, A. Kuwajerwala, S. Morin, K. Jatavallabhula, B. Sen, A. Agarwal, C. Rivera, W. Paul, K. Ellis, R. Chellappa, C. Gan, C. de Melo, J. Tenenbaum, A. Torralba, F. Shkurti, and L. Paull, “Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning,” in arXiv, 2023.
  37. S. Feng, Z. Wu, Y. Zhao, and P. A. Vela, “Trajectory servoing: Image-based trajectory tracking using slam.” CoRR, 2021.
  38. S. R. Bista, P. R. Giordano, and F. Chaumette, “Appearance-based indoor navigation by ibvs using line segments,” IEEE robotics and automation letters, vol. 1, no. 1, pp. 423–430, 2016.
  39. Y. Mezouar and F. Chaumette, “Path planning for robust image-based control,” IEEE transactions on robotics and automation, vol. 18, no. 4, pp. 534–549, 2002.
  40. S. Hutchinson, G. D. Hager, and P. I. Corke, “A tutorial on visual servo control,” IEEE transactions on robotics and automation, vol. 12, no. 5, pp. 651–670, 1996.
  41. A. Cherubini, F. Chaumette, and G. Oriolo, “Visual servoing for path reaching with nonholonomic robots,” Robotica, vol. 29, no. 7, pp. 1037–1048, 2011.
  42. A. Ahmadi, L. Nardi, N. Chebrolu, and C. Stachniss, “Visual servoing-based navigation for monitoring row-crop fields,” in 2020 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2020, pp. 4920–4926.
  43. A. Remazeilles, F. Chaumette, and P. Gros, “3d navigation based on a visual memory,” in Proceedings 2006 IEEE International Conference on Robotics and Automation, 2006. ICRA 2006.   IEEE, 2006, pp. 2719–2725.
  44. A. Diosi, S. Segvic, A. Remazeilles, and F. Chaumette, “Experimental evaluation of autonomous driving based on visual memory and image-based visual servoing,” IEEE Transactions on Intelligent Transportation Systems, vol. 12, no. 3, pp. 870–883, 2011.
  45. G. Blanc, Y. Mezouar, and P. Martinet, “Indoor navigation of a wheeled mobile robot along visual routes,” in Proceedings of the 2005 IEEE international conference on robotics and automation.   IEEE, 2005, pp. 3354–3359.
  46. P. Furgale and T. D. Barfoot, “Visual teach and repeat for long-range rover autonomy,” Journal of field robotics, vol. 27, no. 5, pp. 534–560, 2010.
  47. S. Šegvić, A. Remazeilles, A. Diosi, and F. Chaumette, “A mapping and localization framework for scalable appearance-based navigation,” Computer Vision and Image Understanding, vol. 113, no. 2, pp. 172–187, 2009.
  48. A. M. Zhang and L. Kleeman, “Robust appearance based visual route following for navigation in large-scale outdoor environments,” The International Journal of Robotics Research, vol. 28, no. 3, pp. 331–356, 2009.
  49. D. Dall’Osto, T. Fischer, and M. Milford, “Fast and robust bio-inspired teach and repeat navigation,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2021, pp. 500–507.
  50. M. Mattamala, N. Chebrolu, and M. Fallon, “An efficient locally reactive controller for safe navigation in visual teach and repeat missions,” IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 2353–2360, 2022.
  51. T. Krajník, F. Majer, L. Halodová, and T. Vintr, “Navigation without localisation: reliable teach and repeat based on the convergence theorem,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2018, pp. 1657–1664.
  52. L. Halodová, E. Dvořráková, F. Majer, T. Vintr, O. M. Mozos, F. Dayoub, and T. Krajník, “Predictive and adaptive maps for long-term visual navigation in changing environments,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2019, pp. 7033–7039.
  53. T. Do, L. C. Carrillo-Arce, and S. I. Roumeliotis, “High-speed autonomous quadrotor navigation through visual and inertial paths,” The International Journal of Robotics Research, vol. 38, no. 4, pp. 486–504, 2019.
  54. T. Krajník, P. Cristóforis, K. Kusumam, P. Neubert, and T. Duckett, “Image features for visual teach-and-repeat navigation in changing environments,” Robotics and Autonomous Systems, vol. 88, pp. 127–141, 2017.
  55. M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 9650–9660.
  56. M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al., “Dinov2: Learning robust visual features without supervision,” arXiv preprint arXiv:2304.07193, 2023.
  57. A. Maalouf, N. Jadhav, K. M. Jatavallabhula, M. Chahine, D. M. Vogt, R. J. Wood, A. Torralba, and D. Rus, “Follow anything: Open-set detection, tracking, and following in real-time,” arXiv preprint arXiv:2308.05737, 2023.
  58. N. Keetha, A. Mishra, J. Karhade, K. M. Jatavallabhula, S. Scherer, M. Krishna, and S. Garg, “Anyloc: Towards universal visual place recognition,” arXiv preprint arXiv:2308.00688, 2023.
  59. T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
  60. J. Wasserman, K. Yadav, G. Chowdhary, A. Gupta, and U. Jain, “Last-mile embodied visual navigation,” in Conference on Robot Learning.   PMLR, 2023, pp. 666–678.
  61. H. Wang, W. Liang, L. V. Gool, and W. Wang, “Towards versatile embodied navigation,” Advances in Neural Information Processing Systems, vol. 35, pp. 36 858–36 874, 2022.
  62. F. Xia, A. R. Zamir, Z.-Y. He, A. Sax, J. Malik, and S. Savarese, “Gibson env: real-world perception for embodied agents,” in Computer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on.   IEEE, 2018.
  63. A. Glover, “Day and night, left and right,” Mar. 2014. [Online]. Available: https://doi.org/10.5281/zenodo.4590133
  64. S. Garg and M. Milford, “Seqnet: Learning descriptors for sequence-based hierarchical place recognition,” IEEE Robotics and Automation Letters, vol. 6, no. 3, pp. 4305–4312, 2021.
  65. Y. Zhang, S. Song, P. Tan, and J. Xiao, “Panocontext: A whole-room 3d context model for panoramic scene understanding,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13.   Springer, 2014, pp. 668–686.
  66. C. Zou, A. Colburn, Q. Shan, and D. Hoiem, “Layoutnet: Reconstructing the 3d room layout from a single rgb image,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2051–2059.
  67. M. Savva, A. Kadian, O. Maksymets, Y. Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V. Koltun, J. Malik, et al., “Habitat: A platform for embodied ai research,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 9339–9347.
  68. P. Lindenberger, P.-E. Sarlin, and M. Pollefeys, “LightGlue: Local Feature Matching at Light Speed,” in ICCV, 2023.
  69. K. Rana, J. Haviland, S. Garg, J. Abou-Chakra, I. Reid, and N. Suenderhauf, “Sayplan: Grounding large language models using 3d scene graphs for scalable task planning,” in 7th Annual Conference on Robot Learning, 2023. [Online]. Available: https://openreview.net/forum?id=wMpOMO0Ss7a
Citations (14)

Summary

We haven't generated a summary for this paper yet.