Kimera: from SLAM to Spatial Perception with 3D Dynamic Scene Graphs (2101.06894v3)

Published 18 Jan 2021 in cs.RO and cs.CV

Abstract: Humans are able to form a complex mental model of the environment they move in. This mental model captures geometric and semantic aspects of the scene, describes the environment at multiple levels of abstractions (e.g., objects, rooms, buildings), includes static and dynamic entities and their relations (e.g., a person is in a room at a given time). In contrast, current robots' internal representations still provide a partial and fragmented understanding of the environment, either in the form of a sparse or dense set of geometric primitives (e.g., points, lines, planes, voxels) or as a collection of objects. This paper attempts to reduce the gap between robot and human perception by introducing a novel representation, a 3D Dynamic Scene Graph(DSG), that seamlessly captures metric and semantic aspects of a dynamic environment. A DSG is a layered graph where nodes represent spatial concepts at different levels of abstraction, and edges represent spatio-temporal relations among nodes. Our second contribution is Kimera, the first fully automatic method to build a DSG from visual-inertial data. Kimera includes state-of-the-art techniques for visual-inertial SLAM, metric-semantic 3D reconstruction, object localization, human pose and shape estimation, and scene parsing. Our third contribution is a comprehensive evaluation of Kimera in real-life datasets and photo-realistic simulations, including a newly released dataset, uHumans2, which simulates a collection of crowded indoor and outdoor scenes. Our evaluation shows that Kimera achieves state-of-the-art performance in visual-inertial SLAM, estimates an accurate 3D metric-semantic mesh model in real-time, and builds a DSG of a complex indoor environment with tens of objects and humans in minutes. Our final contribution shows how to use a DSG for real-time hierarchical semantic path-planning. The core modules in Kimera are open-source.

Authors (8)

Antoni Rosinol (10 papers)
Andrew Violette (1 paper)
Marcus Abate (7 papers)
Nathan Hughes (13 papers)
Yun Chang (43 papers)
Jingnan Shi (15 papers)
Arjun Gupta (24 papers)
Luca Carlone (109 papers)

Citations (200)

View on Semantic Scholar

Summary

The paper's main contribution is the development of Kimera, which unifies SLAM, visual-inertial odometry, and semantic mapping in a hierarchical 3D dynamic scene graph.
Kimera-Core generates globally consistent 3D metric-semantic meshes while Kimera-DSG constructs detailed scene graphs by segmenting and tracking both static and dynamic entities.
Evaluations on EuRoC and uHumans datasets demonstrate that Kimera efficiently processes dynamic scenes, enabling scalable and actionable spatial perception for robotics.

An Overview of "Kimera: from SLAM to Spatial Perception with 3D Dynamic Scene Graphs"

The paper "Kimera: from SLAM to Spatial Perception with 3D Dynamic Scene Graphs," authored by Rosinol et al., presents a novel integrated perception framework that aims to advance robotic perception capabilities by bridging the gap between human and robot environmental understanding. Robotics has long been challenged with developing comprehensive systems that deal adequately with the complexity of dynamic, real-world scenes. The proposed solution, Kimera, facilitates a nuanced multi-layer representation of space, termed a 3D Dynamic Scene Graph (DSG), which is designed to comprehensively encapsulate both the metric and semantic dimensions of a scene, inclusive of dynamic and static entities, in real-time.

Key Contributions and Architecture

Kimera's primary contribution is the development and deployment of a DSG as a versatile spatial representation, where nodes represent spatial concepts and edges capture spatio-temporal relationships. The DSG is structured into hierarchical layers, from raw metric-semantic mesh data to higher abstraction levels involving constructs such as rooms, objects, and agents. The depth of detail allows for the integration of semantic understanding at multiple granularities, thereby providing actionable insights for robotic navigation and interaction tasks.

The architecture of Kimera is divided into two main modules: Kimera-Core and Kimera-DSG. Kimera-Core is responsible for generating a globally consistent 3D metric-semantic mesh using visual-inertial data. It includes real-time visual-inertial odometry, 3D mesh construction, and semantic annotation techniques that collectively enable precise environment mapping even in dynamic settings. Conversely, Kimera-DSG leverages this comprehensive mesh to build detailed scene graphs by identifying and tracking dynamic entities, segmenting objects, and parsing spatial elements into actionable constructs such as places and rooms.

Evaluation and Performance

The paper offers a comprehensive evaluation across various benchmarks, including EuRoC datasets for assessing VIO accuracy and the simulated uHumans datasets for testing dynamic scene handling. Kimera demonstrates competitive performance against state-of-the-art VIO systems and showcases robust SLAM capabilities even in crowded environments. Furthermore, the introduction of dynamic masking and the integrated use of 3D scene graphs help in accurately processing dynamic elements within the scene, efficiently segmenting them from actionable spatial data.

Implications and Future Directions

The research presents significant implications for both theoretical exploration and practical applications. The hierarchical and semantically rich DSGs facilitate scalable robotic decision-making, enabling robots to execute high-level navigational queries like "navigate to the kitchen" or "find a person in the hallway." This work underscores the feasibility of more sophisticated human-robot interaction paradigms by grounding language in rich, spatial-semantic representations. Furthermore, it highlights potential advancements in long-term autonomy, allowing for compact, memory-efficient storage and retrieval of salient environment details.

The future of robotic spatial perception may see further advancement through a more diverse array of sensing inputs, including multiple robot system collaboration and real-time DSG building efficiency. The prospect of extending DSGs with detailed physical properties such as object material and affordances also poses an exciting direction for more sophisticated spatial understanding and interaction capabilities.

The introduction of Kimera marks an essential step towards enhancing robotic perception to levels akin to human cognitive mapping, promising a more nuanced and actionable environmental understanding.

PDF Markdown