Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Indoor and Outdoor 3D Scene Graph Generation via Language-Enabled Spatial Ontologies (2312.11713v2)

Published 18 Dec 2023 in cs.RO and cs.AI

Abstract: This paper proposes an approach to build 3D scene graphs in arbitrary indoor and outdoor environments. Such extension is challenging; the hierarchy of concepts that describe an outdoor environment is more complex than for indoors, and manually defining such hierarchy is time-consuming and does not scale. Furthermore, the lack of training data prevents the straightforward application of learning-based tools used in indoor settings. To address these challenges, we propose two novel extensions. First, we develop methods to build a spatial ontology defining concepts and relations relevant for indoor and outdoor robot operation. In particular, we use a LLM to build such an ontology, thus largely reducing the amount of manual effort required. Second, we leverage the spatial ontology for 3D scene graph construction using Logic Tensor Networks (LTN) to add logical rules, or axioms (e.g., "a beach contains sand"), which provide additional supervisory signals at training time thus reducing the need for labelled data, providing better predictions, and even allowing predicting concepts unseen at training time. We test our approach in a variety of datasets, including indoor, rural, and coastal environments, and show that it leads to a significant increase in the quality of the 3D scene graph generation with sparsely annotated data.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. I. Armeni, Z. He, J. Gwak, A. Zamir, M. Fischer, J. Malik, and S. Savarese, “3D scene graph: A structure for unified semantics, 3D space, and camera,” in Intl. Conf. on Computer Vision (ICCV), 2019, pp. 5664–5673.
  2. A. Rosinol, A. Violette, M. Abate, N. Hughes, Y. Chang, J. Shi, A. Gupta, and L. Carlone, “Kimera: from SLAM to spatial perception with 3D dynamic scene graphs,” Intl. J. of Robotics Research, vol. 40, no. 12–14, pp. 1510–1546, 2021.
  3. A. Rosinol, A. Gupta, M. Abate, J. Shi, and L. Carlone, “3D dynamic scene graphs: Actionable spatial perception with places, objects, and humans,” in Robotics: Science and Systems (RSS), 2020.
  4. N. Hughes, Y. Chang, and L. Carlone, “Hydra: a real-time spatial perception engine for 3D scene graph construction and optimization,” in Robotics: Science and Systems (RSS), 2022.
  5. S. Wu, J. Wald, K. Tateno, N. Navab, and F. Tombari, “SceneGraphFusion: Incremental 3D scene graph prediction from RGB-D sequences,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2021.
  6. N. Hughes, Y. Chang, S. Hu, R. Talak, R. Abdulhai, J. Strader, and L. Carlone, “Foundations of spatial perception for robotics: Hierarchical representations and real-time systems,” arXiv preprint: 2305.07154, 2023.
  7. C. Agia, K. M. Jatavallabhula, M. Khodeir, O. Miksik, V. Vineet, M. Mukadam, L. Paull, and F. Shkurti, “Taskography: Evaluating robot task planning over large 3D scene graphs,” in Conference on Robot Learning (CoRL).   PMLR, Jan. 2022, pp. 46–58.
  8. K. Rana, J. Haviland, S. Garg, J. Abou-Chakra, I. Reid, and N. Suenderhauf, “SayPlan: Grounding large language models using 3d scene graphs for scalable task planning,” in 7th Annual Conference on Robot Learning, 2023.
  9. Z. Ravichandran, L. Peng, N. Hughes, J. Griffith, and L. Carlone, “Hierarchical representations and explicit memory: Learning effective navigation policies on 3D scene graphs using graph neural networks,” in IEEE Intl. Conf. on Robotics and Automation (ICRA), 2022.
  10. M. Berg, G. Konidaris, and S. Tellex, “Using Language to Generate State Abstractions for Long-Range Planning in Outdoor Environments,” in IEEE Intl. Conf. on Robotics and Automation (ICRA), May 2022, pp. 1888–1895.
  11. S. Badreddine, A. d’Avila Garcez, L. Serafini, and M. Spranger, “Logic tensor networks,” Artificial Intelligence, vol. 303, p. 103649, 2022.
  12. J. Wald, H. Dhamo, N. Navab, and F. Tombari, “Learning 3D semantic scene graphs from 3D indoor reconstructions,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 3961–3970.
  13. N. Gothoskar, M. Cusumano-Towner, B. Zinberg, M. Ghavamizadeh, F. Pollok, A. Garrett, J. Tenenbaum, D. Gutfreund, and V. Mansinghka, “3DP3: 3D scene perception via probabilistic programming,” in ArXiv preprint: 2111.00312, 2021.
  14. G. Izatt and R. Tedrake, “Scene understanding and distribution modeling with mixed-integer scene parsing,” in technical report (under review), 2021.
  15. E. Greve, M. Büchner, N. Vödisch, W. Burgard, and A. Valada, “Collaborative dynamic 3d scene graphs for automated driving,” arXiv preprint arXiv:2309.06635, 2023.
  16. W. Chen, S. Hu, R. Talak, and L. Carlone, “Leveraging large language models for robot 3D scene understanding,” arXiv preprint: 2209.05629, 2022.
  17. Z. Seymour, N. Chowdhury Mithun, H.-P. Chiu, S. Samarasekera, and R. Kumar, “Graphmapper: Efficient visual navigation by scene graph generation,” in 2022 26th International Conference on Pattern Recognition (ICPR), 2022, pp. 4146–4153.
  18. H. Dhamo, F. Manhardt, N. Navab, and F. Tombari, “Graph-to-3d: End-to-end generation and manipulation of 3d scenes using scene graphs,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 16 352–16 361.
  19. T. Tahara, T. Seno, G. Narita, and T. Ishikawa, “Retargetable ar: Context-aware augmented reality in indoor scenes based on 3d scene graph,” in 2020 IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct), 2020, pp. 249–255.
  20. T. R. Gruber, “Toward principles for the design of ontologies used for knowledge sharing?” International journal of human-computer studies, vol. 43, no. 5-6, pp. 907–928, 1995.
  21. S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives, “DBpedia: A nucleus for a web of open data,” in Semantic Web.   Springer, 2007, pp. 722–735.
  22. R. Speer, J. Chin, and C. Havasi, “ConceptNet 5.5: an open multilingual graph of general knowledge,” in Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, ser. AAAI’17.   AAAI Press, 2017, p. 4444–4451.
  23. A. Zareian, S. Karaman, and S.-F. Chang, “Bridging knowledge graphs to generate scene graphs,” in European Conf. on Computer Vision (ECCV).   Springer, 2020, pp. 606–623.
  24. Y. Movshovitz-Attias, Q. Yu, M. C. Stumpe, V. Shet, S. Arnoud, and L. Yatziv, “Ontological supervision for fine grained classification of street view storefronts,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1693–1702.
  25. W. Zheng, L. Yin, X. Chen, Z. Ma, S. Liu, and B. Yang, “Knowledge base graph embedding module design for visual question answering model,” Pattern Recognition, vol. 120, p. 108153, 2021.
  26. Y. Qiu and H. I. Christensen, “3d scene graph prediction on point clouds using knowledge graphs,” arXiv preprint arXiv:2308.06719, 2023.
  27. S. Zhang, S. Li, A. Hao, and H. Qin, “Knowledge-inspired 3d scene graph prediction in point cloud,” in Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, Eds., vol. 34.   Curran Associates, Inc., 2021, pp. 18 620–18 632.
  28. J. Hsu, J. Mao, and J. Wu, “Ns3d: Neuro-symbolic grounding of 3d objects and relations,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 2614–2623.
  29. C. Chen, K. Lin, and D. Klein, “Constructing taxonomies from pretrained language models,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021, pp. 4687–4700.
  30. K. Sun, Y. E. Xu, H. Zha, Y. Liu, and X. L. Dong, “Head-to-tail: How knowledgeable are large language models (llm)? aka will llms replace knowledge graphs?” arXiv preprint arXiv:2308.10168, 2023.
  31. J. Z. Pan, S. Razniewski, J.-C. Kalo, S. Singhania, J. Chen, S. Dietze, H. Jabeen, J. Omeliyanenko, W. Zhang, M. Lissandrini et al., “Large language models and knowledge graphs: Opportunities and challenges,” arXiv preprint arXiv:2308.06374, 2023.
  32. R. Fagin, R. Riegel, and A. Gray, “Foundations of reasoning with uncertainty via real-valued logics,” arXiv preprint arXiv:2008.02429, 2020.
  33. E. van Krieken, E. Acar, and F. van Harmelen, “Analyzing differentiable fuzzy logic operators,” Artificial Intelligence, vol. 302, p. 103602, 2022.
  34. A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song, A. Zeng, and Y. Zhang, “Matterport3d: Learning from rgb-d data in indoor environments,” International Conference on 3D Vision (3DV), 2017.
  35. M. Savva, A. Kadian, O. Maksymets, Y. Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V. Koltun, J. Malik, D. Parikh, and D. Batra, “Habitat: A Platform for Embodied AI Research,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
  36. J. Jain, J. Li, M. Chiu, A. Hassani, N. Orlov, and H. Shi, “OneFormer: One Transformer to Rule Universal Image Segmentation,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2023.
  37. B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, “Scene parsing through ade20k dataset,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  38. A. Reinke, M. Palieri, B. Morrell, Y. Chang, K. Ebadi, L. Carlone, and A. Agha-mohammadi, “LOCUS 2.0: Robust and computationally efficient lidar odometry for real-time underground 3D mapping,” vol. 7, no. 4, pp. 9043–9050, 2022.
  39. A. Rosinol, M. Abate, Y. Chang, and L. Carlone, “Kimera: an open-source library for real-time metric-semantic localization and mapping,” in IEEE Intl. Conf. on Robotics and Automation (ICRA), 2020.
  40. T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprints arXiv:1301.3781, 2013.
  41. P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lió, and Y. Bengio, “Graph attention networks,” in Intl. Conf. on Learning Representations (ICLR), May 2018.
  42. M. Fey and J. E. Lenssen, “Fast graph representation learning with PyTorch Geometric,” in Intl. Conf. on Learning Representations (ICLR) Workshop on Representation Learning on Graphs and Manifolds, 2019.
Citations (10)

Summary

  • The paper presents a novel integration of language-enabled spatial ontologies and LTNs, boosting indoor accuracy from 12.3% to 25.1% and outdoor from 29.0% to 37.2% with as little as 0.1% labeled data.
  • It leverages LLMs to automatically generate hierarchical spatial rules, streamlining the transition from indoor to complex outdoor 3D scene graph generation.
  • The approach offers actionable insights for robotics, enhancing scene understanding to improve navigation and path planning in diverse environments.

Essay on "Indoor and Outdoor 3D Scene Graph Generation via Language-Enabled Spatial Ontologies"

The paper "Indoor and Outdoor 3D Scene Graph Generation via Language-Enabled Spatial Ontologies" explores a methodical approach to creating 3D scene graphs applicable in both indoor and outdoor environments. This work addresses the complexities involved in expanding 3D scene graph generation from predominantly indoor settings to arbitrary environments including outdoor scenes. The researchers introduce two pivotal solutions: the use of language-enabled spatial ontologies and the utilization of Logic Tensor Networks (LTNs) to achieve this expansion.

Context and Methodology

3D scene graphs offer a hierarchical representation of environments, providing a structural understanding that connects various spatial concepts through a graph-based model. Current methodologies excel indoors with well-established hierarchies; however, transitioning this to outdoor environments is nontrivial due to the increased complexity and diversity of spatial hierarchies. The lack of annotated training data for outdoor scenes further exacerbates the challenge.

To mitigate these issues, the researchers leverage LLMs to automatically generate spatial ontologies, reducing the manual effort traditionally required. These spatial ontologies facilitate the hierarchical categorization of spatial concepts relevant for both indoor and outdoor scenes. Additionally, LTNs are employed to incorporate logical rules, ensuring that the predictions align with common-sense spatial hierarchies. This integration allows the system to function effectively with minimal labeled data and to generalize beyond the data it was initially trained on.

Key Results

The paper reports substantial improvements in generating 3D scene graphs using the proposed methodology. Experiments conducted across varying setups, including indoor (e.g., MP3D dataset) and outdoor environments (e.g., rural and coastal areas), demonstrated increased accuracy in scene comprehension. Notably, the employment of LTNs improved the performance from 12.3% to 25.1% on indoor scenes and from 29.0% to 37.2% for outdoor scenes with only 0.1% of the training data labeled. These results underscore the effectiveness of using spatial ontologies and neuro-symbolic models in compensating for sparse training data.

Implications and Future Directions

The introduction of language-enabled spatial ontologies and LTNs offers a robust pathway for 3D scene graph generation across diverse environments, emphasizing the potential for more generalized and scalable AI systems in robotics and beyond. The implications are significant; improved scene understanding aids in tasks like robotic navigation and path planning, enabling machines to interpret real-world environments more intuitively and accurately.

Looking forward, this paper sets the stage for further investigations into more sophisticated high-level scene graph layers, beyond the current object and place layer paradigms. Additionally, future explorations could delve into integrating other types of relations within the ontology beyond mere inclusion or investigating dynamic scene adaptation using real-time data.

Conclusion

This research presents an innovative stride in spatial perception for robotics, adeptly addressing the gap in outdoor 3D scene graph construction. By intertwining LLM-generated ontologies with LTNs, the approach exemplifies a sophisticated blend of symbolic and statistical AI, marking a step forward in comprehensive and adaptable scene understanding methodologies. This foundational work is not only a technical achievement but also expands the horizons for practical deployment in multifaceted environments, paving the way for future innovations in AI-driven spatial understanding.

X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com