Hierarchical Scene Graph Construction

Updated 17 November 2025

Hierarchical scene graph construction is a method to represent complex scenes using multi-level graphs that connect low-level features with high-level semantic data.
It involves a modular pipeline—segmentation, node instantiation, edge construction, and recursive assembly—to ensure spatial, semantic, and functional consistency.
Applications span 3D scene generation, robotics, and video understanding, improving performance metrics such as Recall@K and enhancing real-world task efficiency.

Hierarchical scene graph construction refers to the process of building structured, multi-level graph representations that capture the semantic, spatial, and/or functional organization of complex scenes. These hierarchical graphs are foundational in 2D/3D scene understanding, generative modeling, spatial reasoning, and robotics, providing a scaffold on which both low-level data (pixels, points) and high-level semantics (objects, regions, tasks) are mapped, manipulated, and interpreted.

1. Mathematical Foundations of Hierarchical Scene Graphs

Hierarchical scene graphs are generally formalized as multi-level, attributed, directed (or undirected) graphs: $G = (V, E, \mathcal{L})$ where $V$ is a set of nodes partitioned by hierarchy level (e.g., floors, rooms, objects, object parts), $E$ is a set of edges representing spatial, semantic, or functional relations, and $\mathcal{L}$ encodes attributes such as geometry, semantics, or functionality.

A prototypical structure appears in HiGS's Progressive Hierarchical Spatial–Semantic Graph (PHiSSG), which defines

$G = (V, E_s, E_p)$

with:

$V=\{v_i\}_{i=1...N}$ , each node associated 1:1 with a unique scene entity (geometry, pose, semantic label, etc.);
$E_s=$ semantic dependency edges (e.g., "lamp depends on table");
$E_p=$ spatial relation edges (e.g., "left-of," "on," "inside"); and each edge $e$ labeled by a relation $r(e)$ .

Hierarchical levels may represent building > floor > room > object > functional element as in KeySG (Werby et al., 1 Oct 2025), or region > place > object, or other specialized strata such as part graphs for fine geometry.

The hierarchy is often enforced by acyclic parent–child edges (e.g., "is-part-of"), and sibling or hypergraph structures encode intra-level relations (adjacency, symmetry, etc.) (Gao et al., 2023). Recursive message passing and graph neural operations propagate information up and down the hierarchy to ensure cross-level coherence.

2. Construction Algorithms and Pipelines

Graph construction pipelines are modular, typically consisting of the following stages:

Low-level scene analysis and segmentation
- 3D: RGB-D stream integration (TSDF/ESDF), point cloud segmentation (e.g., via RANSAC, clustering, open-vocabulary detectors, or neural segmentation)
- 2D: Object detection (DETR, Faster R-CNN), region proposal, saliency estimation
Node instantiation at multiple levels
- Macro-level nodes: floors, rooms, regions detected by plane fitting, connected components, Voronoi diagrams, or community detection in graphs (Hughes et al., 2022, Samuelson et al., 23 Sep 2025)
- Meso-level: subregions (functional areas), object clustering (DBSCAN in KeySG and SceneHGN), grid-based locations (Linok et al., 16 Jul 2025)
- Micro-level: individual objects, semantic parts, or features (fine-grained segmentation, CLIP embeddings, part detectors)
Edge construction
- Semantic and spatial relations determined via geometric thresholds (e.g., overlap, co-planarity, distance, orientation)
- Functional or support relations inferred via combinatorial or energy-based support inference (Ma et al., 2024)
- Belonging, adjacency, and inclusion edges assigned by spatial containment or learned classifiers
Hierarchical assembly
- Parent–child relations assigned top-down (by spatial inclusion, semantic category, or clustering), often recursively
- Cross-level and sibling/hyperedges created where spatial/semantic conditions are met
Optimization and consistency
- Composite layout losses (e.g., spatial, semantic, adversarial) as in HiGS (Hong et al., 31 Oct 2025)
- Explicit regularizers on hierarchy (e.g., $\|h_r - \mathrm{Pool}(h_l)\|^2$ across room–location edges in OVIGo-3DHSG (Linok et al., 16 Jul 2025))
- Embedded deformation or pose graph optimization for global consistency in large-scale environments (Hughes et al., 2022, Chang et al., 2023)
Iterative, user-driven, or LLM-augmented loops
- Anchor selection and expansion (HiGS)
- Programmatic, in-context graph specification (GraphCanvas3D (Liu et al., 2024))
- Hierarchical Retrieval-Augmented Generation (KeySG)

A common pseudocode pattern for hierarchical insertion is:

for hierarchy_level in levels:
    segment/cluster detected entities at current level
    for each parent node:
        assign child nodes via spatial/semantic proximity
        add cross-edges if relation criteria met
propagate features/messages/topological changes along hierarchy

3. Hierarchical Labeling, Knowledge Graphs, and Semantic Taxonomies

Hierarchical scene graph models consistently improve performance in both generation and understanding tasks by leveraging coarse-to-fine category organization (Jiang et al., 2023, Jiang et al., 2023). Benefits include:

Reducing the search space for fine-grained relation prediction (softmax over smaller sets at each hierarchy stage)
Providing robustness to noise, adversarial perturbation, and zero-shot composition (HiKER-SGG (Zhang et al., 2024))
Enabling knowledge transfer from superclass predictions (e.g., "animal" → "dog") under partial observability

Formal taxonomies (e.g., geometric/possessive/semantic relations (Jiang et al., 2023, Jiang et al., 2023)), are enforced in hierarchical Bayesian or contrastive loss objectives. Scene graphs can be augmented with external commonsense knowledge (e.g., GLove/CLIP embeddings, parent–child links from an external knowledge base), bridging vision and language domains with multi-level message passing (Zhang et al., 2024).

4. Generative and User-Controllable Scene Hierarchies

Hierarchical generative models exploit scene graph levels to enable multi-step, user-guided, or programmatic composition:

HiGS/PHiSSG (Hong et al., 31 Oct 2025): Iteratively expands a scene by anchoring at key objects, using a directed graph with spatial and semantic edges; supports recursive local generation conditioned on the current graph and anchor, with layout optimization enforcing spatial plausibility and functional grouping.
SceneGraphGen (Garg et al., 2021): Auto-regressive, hierarchical RNN architecture for unconditional scene graph generation, producing semantically consistent graphs with flexible ordering, capturing global-to-local dependencies with hierarchical GRUs.
GraphCanvas3D (Liu et al., 2024): Enables LLM-driven, graph-editable, and temporally dynamic (4D) scene assembly without retraining.

User interaction is performed at discrete graph levels, enabling control at different semantic abstraction scales (global style, object addition/removal, local region modifications).

5. Application Domains and Evaluation

Hierarchical scene graph construction is foundational in:

3D Scene Generation: Synthesizing controllable, semantically plausible layouts, fine-grained geometry, and time-varying scenes (HiGS, SceneHGN (Gao et al., 2023), GraphCanvas3D)
Robotics and Autonomous Navigation: Memory- and computation-efficient world models for large-scale outdoor/indoor mapping (Samuelson et al., 23 Sep 2025, Hughes et al., 2022), collaborative multi-agent graph construction (Chang et al., 2023), robust task planning via hierarchical retrieval and modular expansion (Viswanathan et al., 2024)
Scene Graph Generation & Reasoning: Improving recall and mean recall (especially with rare relations or missing annotation), adaptive to visual corruptions (HiKER-SGG), commonsense validation (HIERCOM (Jiang et al., 2023, Zhang et al., 2024))
Video/4D Scene Understanding: Tracking inter-object and inter-subject relations over time via hierarchical temporal aggregation (Nguyen et al., 2023, Hou et al., 30 May 2025)

Performance is measured via Recall@K, mean-Recall@K, segmentation IoU, clustering F1, and memory/compute benchmarks. Hierarchical methods have consistently demonstrated improvements, e.g., +7 absolute points on PredCLS R@50 in (Jiang et al., 2023), +5–6 points for open-voc. dynamic relation recall in (Hou et al., 30 May 2025), or mR@50 gains under corruption in (Zhang et al., 2024).

6. Trade-offs, Scalability, and Future Directions

Hierarchical scene graph construction introduces trade-offs between expressiveness, computational complexity, and scalability:

Implicit vs. explicit relations: KeySG avoids explicit relation edges, relying on multi-modal context and hierarchical RAG to scale to large environments and query complexity (Werby et al., 1 Oct 2025)
Depth vs. breadth: Deep hierarchies can more closely mimic human cognitive abstraction but risk overfitting or exceeding context windows; mid-level regions or superclasses help maintain tractability (SceneHGN, OVIGo-3DHSG)
Real-time constraints: Online, incremental operations (Hydra, Hydra-Multi) are feasible by restricting per-layer node count and applying windowed optimization or graph sparsification

Future work targets:

Dynamic, incremental graph adaptation for evolving or partially observed scenes
Tighter integration of LLM/VLMs for closed-loop, multi-modal reasoning
End-to-end joint learning of label taxonomies and structural optimizations
Commonsense and global scene-level constraints via differentiable reasoning atop hierarchical graphs

7. Comparative Table of Approaches

System / Paper	Node Hierarchy	Edge Types	Notable Features
HiGS/PHiSSG (Hong et al., 31 Oct 2025)	Flat objects, compositional via user-anchoring	Spatial, semantic	Multi-step user-driven expansion, recursive layout optimization
Hydra (Hughes et al., 2022)	Mesh → Object → Place → Room → Building	Inclusion, adjacency, odometry	Real-time, multi-threaded, loop-closure optimized
SceneHGN (Gao et al., 2023)	Room → Functional Region → Object → Part	Vertical, horizontal, hyper-edges	Part-level geometry, recursive cVAE
KeySG (Werby et al., 1 Oct 2025)	Building → Floor → Room → Object → Function	Parent–child/implicit	Keyframe-based, retrieval-augmented, scalable to large scenes
Terra (Samuelson et al., 23 Sep 2025)	Place → Region (multi-level)	Adjacency, semantic-inclusion	Terrain-aware, sparse, lightweight for outdoor robots
HIERCOM (Jiang et al., 2023)	Flat or taxonomic label hierarchy	Geometric, possessive, semantic	Plug-in hierarchical relation head, commonsense validation
HiKER-SGG (Zhang et al., 2024)	Scene entities, predicates, superclasses	Knowledge-graph, scene-graph	Robust to corruption, multi-level message passing

This table summarizes representative architectures, their node/edge stratification, and distinguishing operational characteristics.

Hierarchical scene graph construction thus constitutes a robust, general, and extensible paradigm for structured scene representation, generative modeling, and embodied reasoning across static, dynamic, and multi-modal scenarios. The hierarchical models not only encode the multi-scale structure of the world but also provide a foundation for user control, robotic action, and scalable scene understanding.