3D Scene Graphs: Multi-Level Environment Mapping
- 3D Scene Graphs (3DSGs) are hierarchical, graph-based representations that integrate geometric, semantic, and topological relationships within 3D environments.
- They employ multi-modal sensor integration, segmentation, and relational inference techniques to construct detailed and scalable scene models.
- 3DSGs are pivotal for advanced robotic perception, scene reasoning, and planning, enabling efficient navigation and context-aware interactions.
3D Scene Graphs (3DSGs) constitute a formally structured, multi-level graph-based representation of three-dimensional environments, integrating geometric, semantic, and topological relationships between entities. Over the last several years, the 3DSG paradigm has become foundational for scene understanding, supporting advanced robotic perception, action planning, and scene reasoning by capturing not only spatial configurations but also the semantic and relational properties necessary for context-aware tasks in both indoor and outdoor domains.
1. Definition, Structure, and Hierarchical Organization
3DSGs are defined as hierarchical graphs , where nodes represent entities at various abstraction levels—such as points or voxels, objects, places, rooms, regions, and buildings—while edges encode relations (e.g., spatial adjacency, containment, semantic or functional links) (Rosinol et al., 2020, Samuelson et al., 23 Sep 2025). This multi-level hierarchy typically includes:
| Layer | Node Type | Attributes |
|---|---|---|
| 1. Geometric/Metric | Mesh points or sparse voxels | 3D position, color, semantics |
| 2. Objects | Clusters (static/dynamic/parts) | Pose, class label, descriptors |
| 3. Places/Terrain | Traversable areas/terrains | 3D region, free-space bounds |
| 4. Rooms/Regions | Rooms, corridors, regions | Bounding box, functional label |
| 5. Building/Map | Full scene or building abstraction | Aggregated scene attributes |
This structure enables representations from millimeter-level geometry to building-wide abstractions (Cheng et al., 19 Mar 2025).
Edges may denote physical (e.g., “on,” “adjacent to”), topological (e.g., containment or adjacency in the environment), or semantic/functional relationships (e.g., “affords opening,” “belongs to kitchen”). Dynamic scene graphs (DSGs) further extend this to include agents and time-varying elements, encoding temporal relationships (Rosinol et al., 2020).
2. Key Methodologies for 3DSG Construction
The construction of a 3DSG involves integrating multi-modal sensory data (RGB, depth/LiDAR, IMU) with learned and engineered models for segmentation, recognition, and relational inference. Key steps include:
- Metric-Semantic Mapping: Dense or sparse 3D point clouds are generated using SLAM or voxel-based mapping; semantic features are associated with each point using segmentation models (e.g., YOLOv11, CLIP embeddings, class-agnostic masks) (Samuelson et al., 23 Sep 2025, Samuelson et al., 6 Jun 2025).
- Instance and Region Segmentation: Instance segmentation differentiates objects or terrain segments. Techniques such as PointNet, FastSAM, or Control-Follow clustering (for 3D Gaussian Splatting) form object clusters or terrain nodes (Wang et al., 6 Mar 2025, Wald et al., 2020).
- Graph Construction and Relation Inference: Nodes are instantiated for each object/region, while edges are created based on geometric, spatial, or functional reasoning. Deterministic approaches (e.g., geometric proximity, Convex Hull intersection) as well as machine learning (e.g., Graph Convolutional Network message passing, relational graph convolution, or knowledge-guided approaches) are used to assign and filter relationships (Naanaa et al., 2023, Qiu et al., 2023).
- Semantic Attribute and Affordance Augmentation: Affordance attributes, visual descriptors, and higher-order functionalities (e.g., has-part, enables, supports) are attached to nodes/edges. Functionality-aware pipelines segment affordance-relevant parts (e.g., handles, buttons) and link them in the graph for fine-grained interaction (Rotondi et al., 10 Mar 2025).
For dynamic environments, real-time attribute clustering, relocalization, and instance matching are implemented to enable efficient updates and object tracking under scene change (Nguyen et al., 5 Mar 2025).
3. Integration with Reasoning/Planning and Data Efficiency
The 3DSG abstraction serves as a unified backbone for downstream reasoning by both learning-based (LLM, VLM) and classical planners. Recent work addresses the scaling challenge by decoupling visual grounding from symbolic planning and reducing the effective size of the graph context for LLMs:
- Hierarchical Abstraction and Subgraph Extraction: Hierarchies (scene → floor → room → place → object/terrain) allow for efficient high-level reasoning and subgraph extraction. SCRUB and SEEK sparsification algorithms yield minimal, task-relevant subgraphs, drastically improving planning time without sacrificing optimality (Agia et al., 2022).
- Retrieval-Augmented Generation (RAG): Instead of serializing the full graph as LLM context, 3DSG content is indexed in a database (e.g., Neo4j). LLMs are equipped with structured query interfaces (e.g., Cypher) to retrieve only relevant nodes/edges, supporting scalable instruction-following and question-answering tasks even in million-node graphs (Booker et al., 31 Oct 2024, Ray et al., 18 Oct 2025).
- Semantic Search and Plan Grounding: LLMs exploit the multi-scale 3DSG hierarchy by semantic “zoom-in/out” operations or API calls (e.g., expand/contract nodes), focusing reasoning on a relevant subgraph. Classical planners or pathfinding modules handle geometric or navigation-specific subtasks, with LLMs generating high-level plans and corrections (Rana et al., 2023).
This approach yields substantial reductions in token count (up to an order of magnitude) and planning latency (up to 70% reduction), with improved success rates in both simulation and real-world task execution (Booker et al., 31 Oct 2024).
4. Extensions: Semantics, Open Vocabulary, and Functional Granularity
The field is rapidly evolving toward more semantically flexible, richly annotated, and functionally detailed 3DSGs:
- Open Vocabulary/Set Scene Graphs: Systems such as Open3DSG forgo closed label sets, leveraging open-vocabulary vision-LLMs (e.g., CLIP, InstructBLIP) for class and relationship assignment via zero-shot or generative LLM querying (Koch et al., 19 Feb 2024). This enables querying for arbitrary object/relationship types at inference.
- Language-Aligned and Contrastive Pretraining: Contrastive alignment with language (e.g., via CLIP) produces “language-aligned” graph features, supporting zero-shot scene querying and cross-modal room/attribute inference without re-training (Koch et al., 2023).
- Functional-Element and Affordance Graphs: FunGraph and similar approaches augment 3DSGs with fine-grained functional elements (e.g., knobs, handles) and intra-object relationship edges, facilitating affordance grounding and precise language-driven scene interaction (Rotondi et al., 10 Mar 2025).
- Terrain Awareness and Region Abstraction: Outdoor-specific 3DSGs incorporate terrain-aware node/region layers (using GVDs and custom segmentation) to reflect traversability and regional boundaries essential for navigation and outdoor tasking (Samuelson et al., 23 Sep 2025, Samuelson et al., 6 Jun 2025).
5. Evaluations, Performance Metrics, and Use Cases
3DSG evaluation protocols—across indoor and outdoor datasets—measure object/relationship recall, region classification accuracy, planning latency, and memory efficiency:
- Indoor: Benchmarks such as 3DSSG, VRD, and custom simulated environments (Wald et al., 2020, Raboh et al., 2019).
- Outdoor: Hierarchical object/region retrieval, region monitoring F1, and navigation path quality (Samuelson et al., 23 Sep 2025).
- Task Performance: Reinforcement in robotic planning, embodied QA, and real-time relocalization. Memory efficiency is highlighted as essential for scaling to multi-kilometer scenes (Samuelson et al., 23 Sep 2025).
Across benchmarks, recent 3DSG systems achieve or surpass classical approaches in object retrieval and significantly enhance region classification, execution latency, and semantic search quality.
6. Limitations, Ongoing Challenges, and Future Directions
Principal challenges and research directions, as highlighted in the literature, include:
- Robustness and Scalability: Ensuring semantic and spatial accuracy in the face of sensor noise, occlusion, dynamic change, and region ambiguity—especially in outdoor or cluttered settings (Samuelson et al., 6 Jun 2025, Rosinol et al., 2020).
- Online and Real-time Updating: Achieving efficient updates, re-localization, and identity matching for scene graph nodes as scenes evolve (Nguyen et al., 5 Mar 2025).
- Integration with Open-Vocabulary and Functional Attributes: Extending annotation to open classes, affordances, and functionally relevant sub-objects using data-driven and LLM-facilitated approaches (Koch et al., 19 Feb 2024, Rotondi et al., 10 Mar 2025, Cheng et al., 19 Mar 2025).
- Expressive Querying and Reasoning: Advancing structured interfaces, such as RAG with Cypher, for compositional, multi-hop, and quantitative scene reasoning targeting large graphs (Ray et al., 18 Oct 2025).
- Dataset Expansion: Developing richer, cross-modal, and larger-scale datasets—paired with scene graphs—that include outdoor, dynamic, and functionally annotated scenes (Liu et al., 10 Mar 2025, Wang et al., 6 Mar 2025).
- Closing the Gap in Full Autonomy: Further integrating 3DSGs with decision-making frameworks and multimodal models to enhance explainability, interaction, and adaptability in real-world environments (Samuelson et al., 6 Jun 2025, Saxena et al., 19 Dec 2024).
Advancements continue to shape 3DSGs as central, adaptive abstractions for embodied autonomy, multimodal reasoning, and complex environment modeling across robotics and computer vision.