Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
164 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Online Embedding Multi-Scale CLIP Features into 3D Maps (2403.18178v1)

Published 27 Mar 2024 in cs.RO and cs.CV

Abstract: This study introduces a novel approach to online embedding of multi-scale CLIP (Contrastive Language-Image Pre-Training) features into 3D maps. By harnessing CLIP, this methodology surpasses the constraints of conventional vocabulary-limited methods and enables the incorporation of semantic information into the resultant maps. While recent approaches have explored the embedding of multi-modal features in maps, they often impose significant computational costs, lacking practicality for exploring unfamiliar environments in real time. Our approach tackles these challenges by efficiently computing and embedding multi-scale CLIP features, thereby facilitating the exploration of unfamiliar environments through real-time map generation. Moreover, the embedding CLIP features into the resultant maps makes offline retrieval via linguistic queries feasible. In essence, our approach simultaneously achieves real-time object search and mapping of unfamiliar environments. Additionally, we propose a zero-shot object-goal navigation system based on our mapping approach, and we validate its efficacy through object-goal navigation, offline object retrieval, and multi-object-goal navigation in both simulated environments and real robot experiments. The findings demonstrate that our method not only exhibits swifter performance than state-of-the-art mapping methods but also surpasses them in terms of the success rate of object-goal navigation tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. S. Thrun, W. Burgard, and D. Fox, “A probabilistic approach to concurrent mapping and localization for mobile robots,” Autonomous Robots, vol. 5, pp. 253–271, 1998.
  2. F. Endres et al., “An evaluation of the rgb-d slam system,” in 2012 IEEE international conference on robotics and automation.   IEEE, 2012, pp. 1691–1696.
  3. ultralytics yolov8. [Online]. Available: https://github.com/ultralytics/ultralytics
  4. D. S. Chaplot et al., “Object goal navigation using goal-oriented semantic exploration,” in In Neural Information Processing Systems (NeurIPS), 2020.
  5. A. Radford et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning.   PMLR, 2021, pp. 8748–8763.
  6. X. Zhou et al., “Detecting twenty-thousand classes using image-level supervision,” in ECCV, 2022.
  7. C. Huang et al., “Visual language maps for robot navigation,” in Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), London, UK, 2023.
  8. R. F. Salas-Moreno et al., “Slam++: Simultaneous localisation and mapping at the level of objects,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2013, pp. 1352–1359.
  9. J. McCormac et al., “Semanticfusion: Dense 3d semantic mapping with convolutional neural networks,” in 2017 IEEE International Conference on Robotics and automation (ICRA).   IEEE, 2017, pp. 4628–4635.
  10. M. Runz, M. Buffier, and L. Agapito, “Maskfusion: Real-time recognition, tracking and reconstruction of multiple moving objects,” in 2018 IEEE International Symposium on Mixed and Augmented Reality (ISMAR).   IEEE, 2018, pp. 10–20.
  11. J. McCormac et al., “Fusion++: Volumetric object-level slam,” in 2018 international conference on 3D vision (3DV).   IEEE, 2018, pp. 32–41.
  12. B. Xu et al., “Mid-fusion: Octree-based object-level multi-instance dynamic slam,” in 2019 International Conference on Robotics and Automation (ICRA).   IEEE, 2019, pp. 5231–5237.
  13. A. Kamath et al., “Mdetr-modulated detection for end-to-end multi-modal understanding,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1780–1790.
  14. B. Li et al., “Language-driven semantic segmentation,” in International Conference on Learning Representations, 2022. [Online]. Available: https://openreview.net/forum?id=RriDjddCLN
  15. A. Khandelwal et al., “Simple but effective: Clip embeddings for embodied ai,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 14 829–14 838.
  16. A. Majumdar et al., “Zson: Zero-shot object-goal navigation using multimodal goal embeddings,” Advances in Neural Information Processing Systems, vol. 35, pp. 32 340–32 352, 2022.
  17. S. Y. Gadre et al., “Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 23 171–23 181.
  18. N. Yokoyama et al., “Vlfm: Vision-language frontier maps for zero-shot semantic navigation,” in International Conference on Robotics and Automation (ICRA), 2024.
  19. J. Li et al., “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” arXiv preprint arXiv:2301.12597, 2023.
  20. J. Kerr et al., “Lerf: Language embedded radiance fields,” in International Conference on Computer Vision (ICCV), 2023.
  21. S. Peng et al., “Openscene: 3d scene understanding with open vocabularies,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  22. D. Shah et al., “LM-nav: Robotic navigation with large pre-trained models of language, vision, and action,” in 6th Annual Conference on Robot Learning, 2022. [Online]. Available: https://openreview.net/forum?id=UW5A3SweAH
  23. B. Mildenhall et al., “Nerf: Representing scenes as neural radiance fields for view synthesis,” Communications of the ACM, vol. 65, no. 1, pp. 99–106, 2021.
  24. Z. Teed and J. Deng, “DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras,” Advances in neural information processing systems, 2021.
  25. M. Savva et al., “Habitat: A Platform for Embodied AI Research,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
  26. A. Szot et al., “Habitat 2.0: Training home assistants to rearrange their habitat,” in Advances in Neural Information Processing Systems (NeurIPS), 2021.
  27. T. Niwa, S. Taguchi, and N. Hirose, “Spatio-temporal graph localization networks for image-based navigation,” in 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2022, pp. 3279–3286.
  28. Matterport pro2. [Online]. Available: https://matterport.com/pro2
  29. T.-Y. Lin et al., “Microsoft coco: Common objects in context,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13.   Springer, 2014, pp. 740–755.
  30. H. Deguchi, K. Shibata, and S. Taguchi, “Language to map: Topological map generation from natural language path instructions,” in International Conference on Robotics and Automation (ICRA), 2024.
  31. J. Liang et al., “Code as policies: Language model programs for embodied control,” in 2023 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2023, pp. 9493–9500.

Summary

We haven't generated a summary for this paper yet.