Hierarchical Plane-Enhanced Scene Graph
- The paper presents a hierarchical model that uses dominant planar anchors and multi-scale graphs to integrate geometric and semantic features for robust 3D scene understanding.
- It employs RANSAC-based plane extraction, DBSCAN clustering, and combinatorial optimization to accurately segment objects and infer physical support relations.
- The hierarchical graph design enhances spatial reasoning and semantic query efficiency, resulting in significant performance improvements in 3D scene analysis metrics.
A hierarchical plane-enhanced scene graph (HPSG) is a multi-scale, geometric-semantic data structure designed for rigorous 3D scene understanding. It leverages dominant planar structures as spatial anchors, interlinks physical contact and support relations, and organizes objects and spaces into an interpretable layered graph. HPSGs have demonstrated improved performance in both bottom-up methods based on dense 3D point clouds and training-free approaches from sparse RGB views, enabling accurate, efficient physical reasoning and supporting open vocabulary semantic queries.
1. Formal Definition and Architecture
In HPSG frameworks, the scene is represented as a graph whose node set and edge set are stratified by semantic level and physical connectivity. In the three-level model (Feng et al., 11 Nov 2025):
- Level 0: Global scene type (e.g., "office").
- Level 1: Dominant planar anchors (e.g., walls, floors).
- Level 2: Object instances .
Edges reflect both hierarchy and physical interaction:
- links scene type to planes.
- links planes to objects they contain or support.
- encodes pairwise spatial relations among objects (e.g., "on", "next_to").
Each node is annotated with a feature vector including semantic text embeddings and geometric parameters; relations are derived both from explicit geometry and LLM reasoning.
2. Plane Primitive Extraction and Representation
Dominant planes are the fundamental geometric units in HPSG. Extraction is typically performed via RANSAC plane fitting on point clouds derived either from registered RGB-D sweeps (Ma et al., 22 Apr 2024) or lifted sparse RGB views through monocular depth estimation (Feng et al., 11 Nov 2025). The process includes:
- Iterative fit-and-remove loops: selecting minimal point sets, fitting a plane , collecting inliers within distance .
- For each , calculate normal vector and offset .
- Compute the 2D convex hull of inlier points to define the planar region .
Planes are further clustered using DBSCAN in parameter space , with region-growing expansion for multi-view plane unification in sparse-view cases. Semantic labeling is performed by comparing plane normals to the gravity vector : floors and ceilings are those with normals sufficiently parallel or anti-parallel to gravity, and walls are detected by near orthogonal normals and multi-view support.
3. Spatial Configuration, Connectivity, and Pairwise Relations
Adjacency and contact between primitives are formalized via spatial configuration graphs . Geometric adjacency is determined by minimum convex-hull region distance:
Primitives are categorized into:
- Local Support Connection (LSC): contacts across different object clusters, encoding physical support.
- Local Inner Connection (LIC): inter-primitive connections within the same object.
Eight structural patterns are defined (see Fig. 5 (Ma et al., 22 Apr 2024)) to distinguish support types, contact geometry, and watertightness. The watertightness ratio RTO,
with indicating "watertight" (LIC) connectivity.
Spatial and semantic relations among objects in HPSG are either computed via geometric projection (e.g., if projects onto ) or inferred through LLM-based calls that synthesize position and caption data.
4. Primitive and Object Classification via Optimization
Object segmentation and classification within HPSG are performed using combinatorial optimization formulated as a binary quadratic program (Ma et al., 22 Apr 2024). The label assignment is optimized to maximize:
where is a data term penalizing or favoring label assignment based on contact pattern and watertightness, and is a smoothness term that encourages merging inner-connected pairs (LIC) and discourages merging support-connected pairs (LSC). The resulting labels partition planes into object clusters, with each object composed of its constituent primitives.
5. Hierarchical Graph Construction and Support Relation Inference
HPSG integrates geometry and support semantics through a two- (or three-) level graph structure (Ma et al., 22 Apr 2024, Feng et al., 11 Nov 2025). In the object-level hierarchy, directed support edges indicate that supports , as determined by local physical contact (LSC) or, in the absence of direct contact, by identifying under-pinning planes for bounding boxes. The construction algorithm is:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
for i=1…K:
if C_i contains the ground-plane:
SUPP(root, C_i)=1
continue
for j=i+1…K:
if LocalSupport(C_i, C_j)==1:
SUPP(C_i, C_j)=1
elif LocalSupport(C_j, C_i)==1:
SUPP(C_j, C_i)=1
for each non-ground C_i:
if C_i has no local supporter:
for all j:
if GlobalSupport(C_j, C_i)==1:
SUPP(C_j, C_i)=1
for each C_i with no SUPP(*, C_i):
SUPP(root, C_i)=1 |
At the graph level, forms a directed acyclic graph rooted at the invisible node, ensuring a normalized Laplacian invertible for further spectral analysis (Ma et al., 22 Apr 2024). This structure enables two-level semantics: the primitive-level (contact geometry) and object-level (support relations and hierarchy).
6. Representation, Reasoning, and Evaluation
Node and edge attributes in HPSG fuse geometric descriptors (normals, offsets, convex hull area, bounding box position) with semantic text embeddings (captioning via VLM and LLM refinement) (Feng et al., 11 Nov 2025). For scene reasoning, spatial relation edges are constructed either by geometric analysis or by querying pretrained LLMs on feature-enriched prompts.
Open-vocabulary capability is a notable attribute: object nodes carry unconstrained natural language descriptions, supporting free-form semantic queries. The multi-level graph architecture shortens reasoning chains, reducing contextual noise for LLMs (e.g., to locate items "on the wooden desk in this office", only the local subgraph neighbors are traversed).
Evaluation is performed with metrics such as Exact Match@1 (EM@1) for query accuracy and per-query LLM runtime for efficiency. On Space3D-Bench, HPSG achieves a 28.7% EM@1 improvement and a 78.2% speedup over flat ConceptGraphs (Feng et al., 11 Nov 2025). Additional metrics used in support graph construction include spectral errors based on normalized Laplacian eigenvalues and Cheeger bounds, comparing generated graphs to ground-truth via symmetrization.
7. Scalability, Extensions, and Future Prospects
HPSGs are scalable: the root node in the support graph ensures topological connectivity, guaranteeing invertibility of the Laplacian and facilitating robust graph-theoretic computation (Ma et al., 22 Apr 2024). Extension beyond planar primitives is possible, pending definition of adjacency rules and contact pattern dictionaries for other surface types (e.g., cylinders, spheres).
Task-adaptive subgraph extraction, as implemented in Sparse3DPR, dynamically filters irrelevant context in response to queries, further improving inference efficiency (Feng et al., 11 Nov 2025). A plausible implication is HPSG’s appropriateness for deploying in open-ended real-world reasoning systems, especially where generalization and robustness from sparse data are required.
In total, hierarchical plane-enhanced scene graphs represent an intersection of high-fidelity geometric modeling and flexible semantic grounding, yielding interpretable, efficient, and accurate representations for advanced 3D scene understanding and autonomous reasoning.