Do Visual-Language Maps Capture Latent Semantics? (2403.10117v1)
Abstract: Visual-LLMs (VLMs) have recently been introduced in robotic mapping by using the latent representations, i.e., embeddings, of the VLMs to represent the natural language semantics in the map. The main benefit is moving beyond a small set of human-created labels toward open-vocabulary scene understanding. While there is anecdotal evidence that maps built this way support downstream tasks, such as navigation, rigorous analysis of the quality of the maps using these embeddings is lacking. We investigate two critical properties of map quality: queryability and consistency. The evaluation of queryability addresses the ability to retrieve information from the embeddings. We investigate two aspects of consistency: intra-map consistency and inter-map consistency. Intra-map consistency captures the ability of the embeddings to represent abstract semantic classes, and inter-map consistency captures the generalization properties of the representation. In this paper, we propose a way to analyze the quality of maps created using VLMs, which forms an open-source benchmark to be used when proposing new open-vocabulary map representations. We demonstrate the benchmark by evaluating the maps created by two state-of-the-art methods, VLMaps and OpenScene, using two encoders, LSeg and OpenSeg, using real-world data from the Matterport3D data set. We find that OpenScene outperforms VLMaps with both encoders, and LSeg outperforms OpenSeg with both methods.
- S. Garg et al., “Semantics for Robotic Mapping, Perception and Interaction: A Survey,” Foundations and Trends® in Robotics, vol. 8, no. 1–2, pp. 1–224, 2020, arXiv:2101.00443 [cs].
- A. Bendale and T. E. Boult, “Towards Open Set Deep Networks,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA: IEEE, Jun. 2016, pp. 1563–1572.
- C. Geng, S.-J. Huang, and S. Chen, “Recent Advances in Open Set Recognition: A Survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 10, pp. 3614–3631, Oct. 2021.
- A. Vaswani et al., “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- A. Radford et al., “Learning transferable visual models from natural language supervision,” in International Conference on Machine Learning. PMLR, 2021, pp. 8748–8763.
- C. Huang, O. Mees, A. Zeng, and W. Burgard, “Audio visual language maps for robot navigation,” 2023, arXiv:2303.07522 [cs].
- M. Tenorth and M. Beetz, “KNOWROB - knowledge processing for autonomous personal robots,” in 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems. St. Louis, MO, USA: IEEE, oct 2009, pp. 4261–4266.
- C. Huang, O. Mees, A. Zeng, and W. Burgard, “Visual Language Maps for Robot Navigation,” in 2023 IEEE International Conference on Robotics and Automation (ICRA). London, United Kingdom: IEEE, may 2023, pp. 10 608–10 615.
- S. Y. Gadre, M. Wortsman, G. Ilharco, L. Schmidt, and S. Song, “Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation,” in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, Canada, jun 2023, pp. 23 171–23 181.
- D. Shah, B. Osiński, S. Levine et al., “Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action,” in Conference on Robot Learning. Atlanta, GA, USA: PMLR, nov 2023, pp. 492–504.
- S. Peng et al., “Openscene: 3d scene understanding with open vocabularies,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, jun 2023, pp. 815–824.
- A. Chang et al., “Matterport3D: Learning from RGB-D Data in Indoor Environments,” Sep. 2017, arXiv:1709.06158 [cs].
- B. Chen et al., “Open-vocabulary queryable scene representations for real world planning,” in 2023 IEEE International Conference on Robotics and Automation (ICRA). London, UK: IEEE, may 2023, pp. 11 509–11 522.
- Y. Yuan and A. Nüchter, “Uni-fusion: Universal continuous mapping,” IEEE Transactions on Robotics, vol. 40, pp. 1373–1392, jan 2024.
- B. Li, K. Q. Weinberger, S. Belongie, V. Koltun, and R. Ranftl, “Language-driven Semantic Segmentation,” Apr. 2022, arXiv:2201.03546 [cs].
- G. Ghiasi, X. Gu, Y. Cui, and T.-Y. Lin, “Scaling open-vocabulary image segmentation with image-level labels,” in European Conference on Computer Vision. Tel Aviv, Israel: Springer, oct 2022, pp. 540–557.
- H. Zhao, X. Puig, B. Zhou, S. Fidler, and A. Torralba, “Open Vocabulary Scene Parsing,” in 2017 IEEE International Conference on Computer Vision (ICCV). Venice: IEEE, Oct. 2017, pp. 2021–2029.
- M. A. Bravo, S. Mittal, S. Ging, and T. Brox, “Open-vocabulary Attribute Detection,” in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Vancouver, BC, Canada: IEEE, Jun. 2023, pp. 7041–7050.
- A. Bansal, K. Sikka, G. Sharma, R. Chellappa, and A. Divakaran, “Zero-Shot Object Detection,” in Computer Vision – ECCV 2018, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, Eds., vol. 11205. Cham: Springer International Publishing, sep 2018, pp. 397–414, series Title: Lecture Notes in Computer Science.
- A. Zareian, K. D. Rosa, D. H. Hu, and S.-F. Chang, “Open-Vocabulary Object Detection Using Captions,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, TN, USA: IEEE, Jun. 2021, pp. 14 388–14 397.
- X. Gu, T.-Y. Lin, W. Kuo, and Y. Cui, “Open-vocabulary Object Detection via Vision and Language Knowledge Distillation,” May 2022, arXiv:2104.13921 [cs].
- C. Feng et al., “Promptdet: Towards open-vocabulary detection using uncurated images,” in European Conference on Computer Vision. Tel Aviv, Israel: Springer, oct 2022, pp. 701–717.
- A. I. Wagan, A. Godil, and X. Li, “Map quality assessment,” in Proceedings of the 8th Workshop on Performance Metrics for Intelligent Systems. Gaithersburg Maryland: ACM, Aug. 2008, pp. 278–282.
- T. P. Kucner, M. Luperto, S. Lowry, M. Magnusson, and A. J. Lilienthal, “Robust Frequency-Based Structure Extraction,” in 2021 IEEE International Conference on Robotics and Automation (ICRA). Xi’an, China: IEEE, May 2021, pp. 1715–1721.
- S. Aravecchia, M. Clausel, and C. Pradalier, “Comparing metrics for evaluating 3D map quality in natural environments,” Robotics and Autonomous Systems, vol. 173, p. 104617, Mar. 2024.
- R. F. Salas-Moreno, R. A. Newcombe, H. Strasdat, P. H. Kelly, and A. J. Davison, “SLAM++: Simultaneous Localisation and Mapping at the Level of Objects,” in 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, OR, USA: IEEE, Jun. 2013, pp. 1352–1359.
- J. Mccormac, R. Clark, M. Bloesch, A. Davison, and S. Leutenegger, “Fusion++: Volumetric Object-Level SLAM,” in 2018 International Conference on 3D Vision (3DV). Verona: IEEE, Sep. 2018, pp. 32–41.
- M. Runz, M. Buffier, and L. Agapito, “MaskFusion: Real-Time Recognition, Tracking and Reconstruction of Multiple Moving Objects,” in 2018 IEEE International Symposium on Mixed and Augmented Reality (ISMAR). Munich, Germany: IEEE, Oct. 2018, pp. 10–20.
- N. Sunderhauf, T. T. Pham, Y. Latif, M. Milford, and I. Reid, “Meaningful maps with object-oriented semantic mapping,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Vancouver, BC: IEEE, Sep. 2017, pp. 5079–5085.
- W. Chen, S. Hu, R. Talak, and L. Carlone, “Leveraging Large (Visual) Language Models for Robot 3D Scene Understanding,” Nov. 2023, arXiv:2209.05629 [cs].
- G. Narita, T. Seno, T. Ishikawa, and Y. Kaji, “PanopticFusion: Online Volumetric Semantic Mapping at the Level of Stuff and Things,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Macau, China: IEEE, Nov. 2019, pp. 4205–4212.
- R. Adams and L. Bischof, “Seeded region growing,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 16, no. 6, pp. 641–647, Jun. 1994.
- Chenguang Huang and Oier Mees and Andy Zeng and Wolfram Burgard. Vlmaps. [Online]. Available: https://github.com/vlmaps/vlmaps
- M. Savva et al., “Habitat: A platform for embodied ai research,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, oct 2019, pp. 9338–9346.
- A. Szot et al., “Habitat 2.0: Training home assistants to rearrange their habitat,” in Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, Eds., vol. 34. Curran Associates, Inc., 2021, pp. 251–266.
- X. Puig et al., “Habitat 3.0: A co-habitat for humans, avatars and robots,” 2023, arXiv:2310.13724 [cs].
- W. H. Kruskal and W. A. Wallis, “Use of ranks in one-criterion variance analysis,” Journal of the American statistical Association, vol. 47, no. 260, pp. 583–621, apr 1952.
- Matti Pekkanen (3 papers)
- Tsvetomila Mihaylova (11 papers)
- Francesco Verdoja (23 papers)
- Ville Kyrki (102 papers)