Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 175 tok/s
Gemini 2.5 Pro 54 tok/s Pro
GPT-5 Medium 38 tok/s Pro
GPT-5 High 37 tok/s Pro
GPT-4o 108 tok/s Pro
Kimi K2 180 tok/s Pro
GPT OSS 120B 447 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Hierarchical Plane-Enhanced Scene Graph

Updated 13 November 2025
  • The paper presents a hierarchical model that uses dominant planar anchors and multi-scale graphs to integrate geometric and semantic features for robust 3D scene understanding.
  • It employs RANSAC-based plane extraction, DBSCAN clustering, and combinatorial optimization to accurately segment objects and infer physical support relations.
  • The hierarchical graph design enhances spatial reasoning and semantic query efficiency, resulting in significant performance improvements in 3D scene analysis metrics.

A hierarchical plane-enhanced scene graph (HPSG) is a multi-scale, geometric-semantic data structure designed for rigorous 3D scene understanding. It leverages dominant planar structures as spatial anchors, interlinks physical contact and support relations, and organizes objects and spaces into an interpretable layered graph. HPSGs have demonstrated improved performance in both bottom-up methods based on dense 3D point clouds and training-free approaches from sparse RGB views, enabling accurate, efficient physical reasoning and supporting open vocabulary semantic queries.

1. Formal Definition and Architecture

In HPSG frameworks, the scene is represented as a graph G=(V,E)\mathcal{G} = (\mathcal{V}, \mathcal{E}) whose node set and edge set are stratified by semantic level and physical connectivity. In the three-level model (Feng et al., 11 Nov 2025):

  • Level 0: Global scene type V0={s}\mathcal{V}_0 = \{s\} (e.g., "office").
  • Level 1: Dominant planar anchors V1={πk}k=1M\mathcal{V}_1 = \{\pi_k\}_{k=1}^M (e.g., walls, floors).
  • Level 2: Object instances V2={oi}i=1N\mathcal{V}_2 = \{o_i\}_{i=1}^N.

Edges reflect both hierarchy and physical interaction:

  • E0\mathcal{E}_0 links scene type to planes.
  • E1\mathcal{E}_1 links planes to objects they contain or support.
  • E2\mathcal{E}_2 encodes pairwise spatial relations among objects (e.g., "on", "next_to").

Each node vv is annotated with a feature vector fv=[cvpv]\mathbf{f}_v = [\mathbf{c}_v \| \mathbf{p}_v] including semantic text embeddings and geometric parameters; relations are derived both from explicit geometry and LLM reasoning.

2. Plane Primitive Extraction and Representation

Dominant planes are the fundamental geometric units in HPSG. Extraction is typically performed via RANSAC plane fitting on point clouds PP derived either from registered RGB-D sweeps (Ma et al., 22 Apr 2024) or lifted sparse RGB views through monocular depth estimation (Feng et al., 11 Nov 2025). The process includes:

  • Iterative fit-and-remove loops: selecting minimal point sets, fitting a plane Ax+By+Cz+D=0Ax + By + Cz + D = 0, collecting inliers within distance εRANSAC\varepsilon_{\rm RANSAC}.
  • For each Πi\Pi_i, calculate normal vector ni=(A,B,C)/(A,B,C)n_i = (A,B,C)/|| (A,B,C) || and offset D/(A,B,C)D/|| (A,B,C) ||.
  • Compute the 2D convex hull of inlier points to define the planar region RiR_i.

Planes are further clustered using DBSCAN in parameter space (n,d)(\mathbf{n}, d), with region-growing expansion for multi-view plane unification in sparse-view cases. Semantic labeling is performed by comparing plane normals to the gravity vector g\mathbf{g}: floors and ceilings are those with normals sufficiently parallel or anti-parallel to gravity, and walls are detected by near orthogonal normals and multi-view support.

3. Spatial Configuration, Connectivity, and Pairwise Relations

Adjacency and contact between primitives are formalized via spatial configuration graphs Gadj=(V,E)G_{\rm adj} = (V,E). Geometric adjacency is determined by minimum convex-hull region distance:

dist(Πi,Πj)=minpRi,qRjpqθadj\operatorname{dist}(\Pi_i, \Pi_j) = \min_{p\in R_i, q\in R_j} ||p - q|| \leq \theta_{\rm adj}

Primitives are categorized into:

  • Local Support Connection (LSC): contacts across different object clusters, encoding physical support.
  • Local Inner Connection (LIC): inter-primitive connections within the same object.

Eight structural patterns are defined (see Fig. 5 (Ma et al., 22 Apr 2024)) to distinguish support types, contact geometry, and watertightness. The watertightness ratio RTO,

RTO=min(p1q2p1p2,p1q2q1q2)RTO = \min\left( \frac{p_1q_2}{p_1p_2}, \frac{p_1q_2}{q_1q_2} \right)

with RTOτ=0.86RTO \geq \tau = 0.86 indicating "watertight" (LIC) connectivity.

Spatial and semantic relations among objects in HPSG are either computed via geometric projection (e.g., Ak,i(1)=1A_{k,i}^{(1)}=1 if oio_i projects onto πk\pi_k) or inferred through LLM-based calls that synthesize position and caption data.

4. Primitive and Object Classification via Optimization

Object segmentation and classification within HPSG are performed using combinatorial optimization formulated as a binary quadratic program (Ma et al., 22 Apr 2024). The label assignment xviljx_{v_i \leftarrow l_j} is optimized to maximize:

E(X)=i,jxviljD(vi,lj)+12j=1Nm=1Nn=1NxvmljxvnljF(vm,vn)E(X) = \sum_{i,j} x_{v_i \leftarrow l_j} D(v_i,l_j) + \tfrac{1}{2} \sum_{j=1}^N \sum_{m=1}^N \sum_{n=1}^N x_{v_m \leftarrow l_j} x_{v_n \leftarrow l_j} F(v_m, v_n)

where D(vi,lj)D(v_i, l_j) is a data term penalizing or favoring label assignment based on contact pattern and watertightness, and F(vi,vj)F(v_i, v_j) is a smoothness term that encourages merging inner-connected pairs (LIC) and discourages merging support-connected pairs (LSC). The resulting labels partition planes into object clusters, with each object composed of its constituent primitives.

5. Hierarchical Graph Construction and Support Relation Inference

HPSG integrates geometry and support semantics through a two- (or three-) level graph structure (Ma et al., 22 Apr 2024, Feng et al., 11 Nov 2025). In the object-level hierarchy, directed support edges SUPP(Ci,Cj)=1SUPP(C_i, C_j)=1 indicate that CiC_i supports CjC_j, as determined by local physical contact (LSC) or, in the absence of direct contact, by identifying under-pinning planes for bounding boxes. The construction algorithm is:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
for i=1…K:
    if C_i contains the ground-plane:
        SUPP(root, C_i)=1
        continue
    for j=i+1…K:
        if LocalSupport(C_i, C_j)==1:
            SUPP(C_i, C_j)=1
        elif LocalSupport(C_j, C_i)==1:
            SUPP(C_j, C_i)=1
for each non-ground C_i:
    if C_i has no local supporter:
        for all j:
            if GlobalSupport(C_j, C_i)==1:
                SUPP(C_j, C_i)=1
for each C_i with no SUPP(*, C_i):
    SUPP(root, C_i)=1

At the graph level, G\mathcal{G} forms a directed acyclic graph rooted at the invisible node, ensuring a normalized Laplacian invertible for further spectral analysis (Ma et al., 22 Apr 2024). This structure enables two-level semantics: the primitive-level (contact geometry) and object-level (support relations and hierarchy).

6. Representation, Reasoning, and Evaluation

Node and edge attributes in HPSG fuse geometric descriptors (normals, offsets, convex hull area, bounding box position) with semantic text embeddings (captioning via VLM and LLM refinement) (Feng et al., 11 Nov 2025). For scene reasoning, spatial relation edges are constructed either by geometric analysis or by querying pretrained LLMs on feature-enriched prompts.

Open-vocabulary capability is a notable attribute: object nodes carry unconstrained natural language descriptions, supporting free-form semantic queries. The multi-level graph architecture shortens reasoning chains, reducing contextual noise for LLMs (e.g., to locate items "on the wooden desk in this office", only the local subgraph neighbors are traversed).

Evaluation is performed with metrics such as Exact Match@1 (EM@1) for query accuracy and per-query LLM runtime for efficiency. On Space3D-Bench, HPSG achieves a 28.7% EM@1 improvement and a 78.2% speedup over flat ConceptGraphs (Feng et al., 11 Nov 2025). Additional metrics used in support graph construction include spectral errors based on normalized Laplacian eigenvalues and Cheeger bounds, comparing generated graphs to ground-truth via symmetrization.

7. Scalability, Extensions, and Future Prospects

HPSGs are scalable: the root node in the support graph ensures topological connectivity, guaranteeing invertibility of the Laplacian and facilitating robust graph-theoretic computation (Ma et al., 22 Apr 2024). Extension beyond planar primitives is possible, pending definition of adjacency rules and contact pattern dictionaries for other surface types (e.g., cylinders, spheres).

Task-adaptive subgraph extraction, as implemented in Sparse3DPR, dynamically filters irrelevant context in response to queries, further improving inference efficiency (Feng et al., 11 Nov 2025). A plausible implication is HPSG’s appropriateness for deploying in open-ended real-world reasoning systems, especially where generalization and robustness from sparse data are required.

In total, hierarchical plane-enhanced scene graphs represent an intersection of high-fidelity geometric modeling and flexible semantic grounding, yielding interpretable, efficient, and accurate representations for advanced 3D scene understanding and autonomous reasoning.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Hierarchical Plane-Enhanced Scene Graph.