QueSTMaps: Queryable Semantic Topological Maps for 3D Scene Understanding (2404.06442v2)
Abstract: Robotic tasks such as planning and navigation require a hierarchical semantic understanding of a scene, which could include multiple floors and rooms. Current methods primarily focus on object segmentation for 3D scene understanding. However, such methods struggle to segment out topological regions like "kitchen" in the scene. In this work, we introduce a two-step pipeline to solve this problem. First, we extract a topological map, i.e., floorplan of the indoor scene using a novel multi-channel occupancy representation. Then, we generate CLIP-aligned features and semantic labels for every room instance based on the objects it contains using a self-attention transformer. Our language-topology alignment supports natural language querying, e.g., a "place to cook" locates the "kitchen". We outperform the current state-of-the-art on room segmentation by ~20% and room classification by ~12%. Our detailed qualitative analysis and ablation studies provide insights into the problem of joint structural and semantic 3D scene understanding. Project Page: quest-maps.github.io
- X. Han, S. Li, X. Wang, and W. Zhou, “Semantic mapping for mobile robots in indoor scenes: A survey,” Information, vol. 12, no. 2, p. 92, 2021.
- R. Barber, J. Crespo, C. Gómez, A. C. Hernámdez, and M. Galli, “Mobile robot navigation in indoor environments: Geometric, topological, and semantic navigation,” in Applications of Mobile Robots. IntechOpen, 2018.
- J. Crespo, J. C. Castillo, O. M. Mozos, and R. Barber, “Semantic information for robot navigation: A survey,” Applied Sciences, vol. 10, no. 2, p. 497, 2020.
- N. Zimmerman, T. Guadagnino, X. Chen, J. Behley, and C. Stachniss, “Long-term localization using semantic cues in floor plan maps,” IEEE Robotics and Automation Letters, vol. 8, no. 1, pp. 176–183, 2022.
- J. Chen, Y. Qian, and Y. Furukawa, “Heat: Holistic edge attention transformer for structured reconstruction,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 3856–3865.
- K. Muravyev and K. Yakovlev, “Evaluation of topological mapping methods in indoor environments,” IEEE Access, vol. 11, pp. 132 683–132 698, 2023.
- S. Garg, N. Sünderhauf, F. Dayoub, D. Morrison, A. Cosgun, G. Carneiro, Q. Wu, T.-J. Chin, I. Reid, S. Gould et al., “Semantics for robotic mapping, perception and interaction: A survey,” Foundations and Trends® in Robotics, vol. 8, no. 1–2, pp. 1–224, 2020.
- S. Peng, K. Genova, C. Jiang, A. Tagliasacchi, M. Pollefeys, T. Funkhouser et al., “Openscene: 3d scene understanding with open vocabularies,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 815–824.
- D. Seichter, M. Köhler, B. Lewandowski, T. Wengefeld, and H.-M. Gross, “Efficient rgb-d semantic segmentation for indoor scene analysis,” in 2021 IEEE international conference on robotics and automation (ICRA). IEEE, 2021, pp. 13 525–13 531.
- D. Wolf, J. Prankl, and M. Vincze, “Enhancing semantic segmentation for robotics: The power of 3-d entangled forests,” IEEE Robotics and Automation Letters, vol. 1, no. 1, pp. 49–56, 2015.
- A. Hermans, G. Floros, and B. Leibe, “Dense 3d semantic mapping of indoor scenes from rgb-d images,” in 2014 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2014, pp. 2631–2638.
- A. Takmaz, E. Fedele, R. Sumner, M. Pollefeys, F. Tombari, and F. Engelmann, “Openmask3d: Open-vocabulary 3d instance segmentation,” Advances in Neural Information Processing Systems, vol. 36, 2024.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763.
- J. Zheng, J. Zhang, J. Li, R. Tang, S. Gao, and Z. Zhou, “Structured3d: A large photo-realistic dataset for structured 3d modeling,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16. Springer, 2020, pp. 519–535.
- A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song, A. Zeng, and Y. Zhang, “Matterport3d: Learning from rgb-d data in indoor environments,” 3DV, 2017.
- Y. Yue, T. Kontogianni, K. Schindler, and F. Engelmann, “Connecting the Dots: Floorplan Reconstruction Using Two-Level Queries,” in CVPR, 2023.
- W. Chen, S. Hu, R. Talak, and L. Carlone, “Leveraging large (visual) language models for robot 3d scene understanding,” 2023.
- A. A. Oliver and D. Huber, “3d reconstruction of interior wall surfaces under occlusion and clutter,” in Proceedings of 3D Imaging, Modeling, Processing, Visualization and Transmission (3DIMPVT), May 2011.
- C. Mura, O. Mattausch, A. J. Villanueva, E. Gobbetti, and R. Pajarola, “Automatic room detection and reconstruction in cluttered indoor environments with complex room layouts,” Computers & Graphics, vol. 44, pp. 20–32, 2014.
- R. Cabral and Y. Furukawa, “Piecewise planar and compact floorplan reconstruction from images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014.
- R. Ambruş, S. Claici, and A. Wendt, “Automatic room segmentation from unstructured 3-d data of indoor environments,” IEEE Robotics and Automation Letters, vol. 2, no. 2, pp. 749–756, 2017.
- K. He, G. Gkioxari, P. Dollár, and R. B. Girshick, “Mask r-cnn,” 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2980–2988, 2017.
- J. Chen, C. Liu, J. Wu, and Y. Furukawa, “Floor-sp: Inverse cad for floorplans by sequential room-wise shortest path,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 2661–2670.
- C. Liu, J. Wu, and Y. Furukawa, “Floornet: A unified framework for floorplan reconstruction from 3d scans,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 201–217.
- N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in European conference on computer vision. Springer, 2020, pp. 213–229.
- J.-W. Su, K.-Y. Tung, C.-H. Peng, P. Wonka, and H.-K. J. Chu, “Slibo-net: Floorplan reconstruction via slicing box representation with local geometry regularization,” in Advances in Neural Information Processing Systems, A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds., vol. 36. Curran Associates, Inc., 2023, pp. 48 781–48 792.
- I. Kostavelis and A. Gasteratos, “Semantic mapping for mobile robotics tasks: A survey,” Robotics and Autonomous Systems, vol. 66, pp. 86–103, 2015.
- M. Nieuwenhuisen, J. Stückler, and S. Behnke, “Improving indoor navigation of autonomous robots by an explicit representation of doors,” in 2010 IEEE International Conference on Robotics and Automation. IEEE, 2010, pp. 4895–4901.
- F. Poux and R. Billen, “Voxel-based 3d point cloud semantic segmentation: Unsupervised geometric and relationship featuring vs deep learning methods,” ISPRS International Journal of Geo-Information, vol. 8, no. 5, p. 213, 2019.
- C. Couprie, C. Farabet, L. Najman, and Y. Lecun, “Indoor Semantic Segmentation using depth information,” in First International Conference on Learning Representations (ICLR 2013), Scottsdale, AZ, United States, May 2013, pp. 1–8, 8 pages, 3 figures.
- L. Han, T. Zheng, L. Xu, and L. Fang, “Occuseg: Occupancy-aware 3d instance segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 2940–2949.
- W. Wang, R. Yu, Q. Huang, and U. Neumann, “Sgpn: Similarity group proposal network for 3d point cloud instance segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2569–2578.
- C. Xie, Y. Xiang, A. Mousavian, and D. Fox, “Unseen object instance segmentation for robotic environments,” IEEE Transactions on Robotics, vol. 37, no. 5, pp. 1343–1359, 2021.
- T. T. Pham, T.-T. Do, N. Sünderhauf, and I. Reid, “Scenecut: Joint geometric and object segmentation for indoor scenes,” in 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 3213–3220.
- H. Bavle, J. L. Sanchez-Lopez, M. Shaheer, J. Civera, and H. Voos, “Situational graphs for robot navigation in structured indoor environments,” IEEE Robotics and Automation Letters, vol. 7, no. 4, pp. 9107–9114, 2022.
- I. Armeni, Z.-Y. He, J. Gwak, A. R. Zamir, M. Fischer, J. Malik, and S. Savarese, “3d scene graph: A structure for unified semantics, 3d space, and camera,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
- Z. Liao, Y. Zhang, J. Luo, and W. Yuan, “Tsm: Topological scene map for representation in indoor environment understanding,” IEEE Access, vol. 8, pp. 185 870–185 884, 2020.
- Q. Gu, A. Kuwajerwala, S. Morin, K. Jatavallabhula, B. Sen, A. Agarwal, C. Rivera, W. Paul, K. Ellis, R. Chellappa, C. Gan, C. de Melo, J. Tenenbaum, A. Torralba, F. Shkurti, and L. Paull, “Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning,” in ICRA, 2024.
- S. Garg, K. Rana, M. Hosseinzadeh, L. Mares, N. Suenderhauf, F. Dayoub, and I. Reid, “Robohop: Segment-based topological map representation for open-world visual navigation,” in 2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024.
- B. Chen, F. Xia, B. Ichter, K. Rao, K. Gopalakrishnan, M. S. Ryoo, A. Stone, and D. Kappler, “Open-vocabulary queryable scene representations for real world planning,” in 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 11 509–11 522.
- J. Strader, N. Hughes, W. Chen, A. Speranzon, and L. Carlone, “Indoor and outdoor 3d scene graph generation via language-enabled spatial ontologies,” 2023.
- H. Chang, K. Boyalakuntla, S. Lu, S. Cai, E. P. Jing, S. Keskar, S. Geng, A. Abbas, L. Zhou, K. Bekris, and A. Boularious, “Context-aware entity grounding with open-vocabulary 3d scene graphs,” in 7th Annual Conference on Robot Learning, 2023.
- D. Kim, N. Oh, D. Hwang, and D. Park, “Lingo-space: Language-conditioned incremental grounding for space,” arXiv preprint arXiv:2402.01183, 2024.
- J. G. Rogers and H. I. Christensen, “A conditional random field model for place and object classification,” in 2012 IEEE International Conference on Robotics and Automation. IEEE, 2012, pp. 1766–1772.
- T. T. Pham, M. Eich, I. Reid, and G. Wyeth, “Geometrically consistent plane extraction for dense indoor 3d maps segmentation,” in 2016 IEEE/RSJ international conference on intelligent robots and systems (IROS). IEEE, 2016, pp. 4199–4204.
- N. Sünderhauf, F. Dayoub, S. McMahon, B. Talbot, R. Schulz, P. Corke, G. Wyeth, B. Upcroft, and M. Milford, “Place categorization and semantic mapping on a mobile robot,” in 2016 IEEE international conference on robotics and automation (ICRA). IEEE, 2016, pp. 5729–5736.
- K. Zheng and A. Pronobis, “From pixels to buildings: End-to-end probabilistic deep networks for large-scale semantic mapping,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2019, pp. 3511–3518.
- A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, P. Dollar, and R. Girshick, “Segment anything,” in ICCV, 2023, pp. 4015–4026.
- S. Black, L. Gao, P. Wang, C. Leahy, and S. Biderman, “Gpt-neo: Large scale autoregressive language modeling with mesh-tensorflow,” 2021.
- Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” 2019, cite arxiv:1907.11692.
- W. L. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” in Neural Information Processing Systems, 2017.
- G. Kumar, N. S. Shankar, H. Didwania, R. Roychoudhury, B. Bhowmick, and K. M. Krishna, “Gcexp: Goal-conditioned exploration for object goal navigation,” in 2021 30th IEEE International Conference on Robot & Human Interactive Communication (RO-MAN), 2021, pp. 123–130.
- K. Rana, J. Haviland, S. Garg, J. Abou-Chakra, I. Reid, and N. Suenderhauf, “Sayplan: Grounding large language models using 3d scene graphs for scalable robot task planning,” in 7th Annual Conference on Robot Learning, 2023.