Hierarchical Open-Vocabulary Scene Graphs

Updated 15 April 2026

HOV-SG is a structured, multi-level graph representation that models complex visual scenes through open-vocabulary semantics and spatial relations.
The framework integrates advanced vision-language models with hierarchical segmentation and clustering, constructing nodes and edges from objects to environments.
It enables robust applications in navigation, scene synthesis, and mapping by leveraging LLMs for zero-shot predicate alignment and hierarchical relation modeling.

A Hierarchical Open-Vocabulary Scene Graph (HOV-SG) is a structured, multi-level graph-based representation designed to model the semantic and relational structure of complex visual environments—spanning both 2D images and 3D spaces—without restricting semantics or relationships to a fixed, closed vocabulary. HOV-SG integrates foundation vision-LLMs, LLMs, and hierarchical abstraction, yielding a formalism that enables zero-shot category extension, interpretable spatial structure, efficient scene reasoning, and robust downstream application to navigation, grounding, and scene synthesis. This paradigm has driven significant advances in open-world perception, robotic scene understanding, and language-driven reasoning.

1. Formal Structure and Semantics of HOV-SG

At its core, HOV-SG represents a scene as a multi-layer graph $\mathcal G = (V, E, A, L)$ where nodes $V$ are partitioned into abstraction levels, e.g., objects, rooms, floors, and the building (or, in outdoor settings, objects, road segments, intersections, and environment). Edges $E$ include intra-level (e.g., adjacency), inter-level (e.g., parent-child), semantic (e.g., “on top of,” “contains”), and spatial/proximity relations. Each node and edge is associated with feature attributes $A_V,$ $A_E$ , typically open-vocabulary CLIP- or VLM-derived embeddings, and an abstraction level label $L$ (Werby et al., 2024, Puigjaner et al., 2 Feb 2026, Sun et al., 15 Feb 2025, Xu et al., 2024, Devarakonda et al., 2024, Deng et al., 2024, Steinke et al., 11 Mar 2025).

For 3D or embodied scenes, typical node levels are:

$V^{(1)}$ : Object instances with open-vocabulary semantic descriptors
$V^{(2)}$ : Place/room segmentation (often via clustering or geometric partitioning)
$V^{(3)}$ : Floors (or, for outdoor, road segments, lane structures)
$V^{(4)}$ : Root or environment node

This hierarchical construction enforces spatial containment, enables fast retrieval (object $V$ 0 room $V$ 1 floor), and supports reasoning over both fine and coarse semantic levels.

2. Construction Methodologies and Pipeline Components

2.1. Hierarchical Scene Graph Construction

Construction proceeds from raw sensor data (RGB(-D), LiDAR, or point clouds) using a series of geometric, topological, and semantic partitioning steps:

Semantic Segmentation and Mask Projection: Advanced segmentation (SAM, Grounding DINO, TAP) produces 2D or 3D masks, which are then back-projected into the global frame using odometry or extrinsics (Werby et al., 2024, Steinke et al., 11 Mar 2025, Deng et al., 2024).
Instance/Node Formation and Semantic Labeling: Masked regions are encoded with vision-LLMs (CLIP, BLIP, Sentence-BERT, or Uni3D), yielding open-vocabulary, language-aligned feature vectors for each object or region (Liu et al., 2024, Xu et al., 2024).
Hierarchical Partitioning: Floor, room, and segment boundaries are defined using clustering in the height dimension (for floors), watershed on BEV-projected points (for rooms), or geometric/DBSCAN approaches (for outdoor segments or urban lane-graphs) (Werby et al., 2024, Deng et al., 2024, Steinke et al., 11 Mar 2025).
Edge Construction: Hierarchical parent–child edges, adjacency and spatial relations (e.g., “left-of” via centroid delta), and (if available) semantic relationship edges via vision–language relation heads (Xu et al., 2024, Liu et al., 2024).

A representative scene graph instantiation: | Level | Node Type | Edges | Feature Attribute | |-------------|---------------------|-------------------------------|-------------------------| | $V$ 2 | Object instances | intra-object rel, to room | CLIP/VLM object embedding | | $V$ 3 | Room/segment | to contained objects, adj. | Room-level CLIP cluster | | $V$ 4 | Floor/road segment | to rooms below, to other segs | Floor label embedding | | $V$ 5 | Root/environment | to all floors | Environment descriptor |

2.2. Open-Vocabulary Semantic Mapping

Open-vocabulary assignment leverages joint vision–language embedding spaces, generally freezing the CLIP/BLIP/VLM backbone and ranking visual features against arbitrary text queries via cosine similarity. Multi-crop fusion, k-means over view clusters, or Sentence-BERT encodings are common techniques for robust per-node semantics (Werby et al., 2024, Deng et al., 2024, Sun et al., 15 Feb 2025).

For relationships, open-vocabulary predicate heads (Bayesian or hierarchical) and entity-aware/region-aware hierarchical prompts (with LLM mining) enable the extension to zero-shot predicates while maintaining semantic structure and strong performance on novel relations (Jiang et al., 2023, Liu et al., 2024).

2.3. Graph Optimization and Update Algorithms

Dynamic scene graphs in multi-agent or dynamic environments employ incremental association, fusion, and relabeling schemes: keyframe pose graph optimization, DBSCAN-based association, IoU or embedding-similarity-based object fusion, and periodic semantic updates (Steinke et al., 11 Mar 2025, Devarakonda et al., 2024).

Memory efficiency is achieved by storing only per-segment/room CLIP features, leading to a ≈ 75% reduction in storage relative to dense voxel-based alternatives, while retaining global retrieval capabilities (Werby et al., 2024, Deng et al., 2024).

3. Integration with Vision-LLMs and LLM Reasoning

HOV-SG frameworks exploit foundation models at multiple stages:

Vision-LLMs (VLMs) provide dense, open-vocabulary feature representations at node and relation levels, underpinning zero-shot identification and retrieval (Deng et al., 2024, Werby et al., 2024).
Hierarchical Prompting and LLM Mining: Hierarchical relation and entity clustering uses LLMs to mine fine-grained, region-aware prompts. This two-level prompt structure (entity-aware + region-aware) boosts novel predicate alignment and model robustness (Liu et al., 2024).
Task and Query Reasoning via LLMs: For downstream query parsing, room-type designation, and multi-step task planning, LLMs process graph signatures or node/edge features, decompose commands (chain-of-thought), and coordinate plan generation in natural language (Puigjaner et al., 2 Feb 2026, Devarakonda et al., 2024, Sun et al., 15 Feb 2025).

This tight coupling yields high-level scene understanding, supports spatial referencing across hierarchical layers (room to object, object to object), and enables robust grounding of natural language queries in complex environments (Linok et al., 16 Jul 2025, Puigjaner et al., 2 Feb 2026, Sun et al., 15 Feb 2025).

4. Quantitative Results and Evaluation Protocols

Experiments consistently demonstrate that HOV-SG methods outperform closed-set or flat-vocabulary baselines in zero-shot segmentation, open-vocabulary retrieval, semantic alignment, and physical feasibility:

Object/Room Segmentation and Retrieval: On HM3DSem, HOV-SG achieves AUC $V$ 6 of 84.9% (ConceptGraphs 84.1%, VLMaps 56.2%) (Werby et al., 2024).
Open-Vocabulary Segmentation: On SemanticKITTI, OpenGraph mIoU (seq 03) 0.6051, F1 0.7302, outperforming RangeNet++ and DeepLab V3 (Deng et al., 2024).
Physical Feasibility Metrics: HOV-SG yields 0% OOB and overlap, KL-divergence 0.09, compared to ATISS (OOB 0.48%) and DiffuScene (OOB 0.77%) (Sun et al., 15 Feb 2025).
SGG Benchmarks: On Visual Genome, hierarchical relation heads (Bayesian or prompt-based) increase mR@50 by >6 points versus flat heads and yield a zero-shot PredCLS R@50 of 20.4 (vs baseline 3.6–15.1) (Jiang et al., 2023, Liu et al., 2024).
Navigation Success: In real-world trials, hierarchical open-vocabulary scene graphs enable robot navigation with 56.1–100% task completion for object, room, and floor-level goals (Werby et al., 2024, Devarakonda et al., 2024).

Setting	Key Metric	HOV-SG Result	Baseline
HM3DSem (obj ret.)	AUC $V$ 7	84.9%	84.1% / 56.2%
SemanticKITTI (open-seg)	mIoU / F1 (seq 03)	0.6051 / 0.7302	0.4780 / 0.6115
Scene synthesis (OOB/overlap/KL)	0.0% / 0.0% / 0.09	0.48% / 0.18% / 0.19
Visual Genome (PredCLS-zsR@50)	20.4	3.6–15.1

5. Open-Vocabulary Relation Modeling and Hierarchical Prompting

HOV-SG advances open-vocabulary relation modeling by hierarchically structuring both the relation label space (using super-categories, e.g., geometric, possessive, semantic (Jiang et al., 2023)) and the textual representation pool (super-entities and region-level prompts (Liu et al., 2024)):

Bayesian Relation Heads: Factorize relation prediction via $V$ 8, supporting seamless insertion of novel predicates and zero-shot retrieval (Jiang et al., 2023).
Hierarchical Prompt Pools: RAHP constructs two-level prompt libraries: (1) entity-aware: subject–object super-pair + predicate, (2) region-aware: LLM-mined spatially or functionally specific prompts (Liu et al., 2024). Dynamic selection mechanisms efficiently prune noise.
Clustering and Prompt Mining: Entities are clustered via text encoder and k-means, with LLMs generating human-interpretable super-entity names and region descriptions, balancing diversity and computational cost.

RAHP and similar frameworks show 4–8 points mR@100 improvement on Visual Genome and Open Images v6, consistently increasing recall on novel predicate types.

6. Real-World Applications and System-Level Integration

HOV-SG underpins robust scene understanding pipelines in diverse embodied and virtual domains:

Language-Conditioned Navigation: HOV-SG supplies hierarchical grounding and planning for robots, parsing multi-level spatial queries (object, room, floor), integrating Voronoi-based motion graphs, and achieving real-world navigation in multi-floor or dynamic environments (Werby et al., 2024, Devarakonda et al., 2024).
Scene Synthesis: Hierarchical graph representations coupled with LLMs and hierarchy-aware graph nets enable scene generation pipelines that enforce both physical constraints and semantic fidelity, outperforming flat LLM methods in both physical feasibility and user-alignment (Sun et al., 15 Feb 2025).
Collaborative Mapping: Multi-agent systems such as CURB-OSG and OpenGraph scale HOV-SG principles to large-scale outdoor or urban settings, with dynamic fusion, zero-shot semantic labeling, and hierarchical memory-efficient map structures (Steinke et al., 11 Mar 2025, Deng et al., 2024).
Task Reasoning and Symbolic Querying: HOV-SG enables LLM-driven parsing of tasks and subgoal reasoning over structured semantic graphs, supporting robust agent interaction in realistic settings (Puigjaner et al., 2 Feb 2026, Liu et al., 2024).

7. Limitations, Open Challenges, and Future Research

Significant open issues remain:

Hierarchy/Cluster Design: Manual super-category taxonomies or static entity clustering risk misrepresenting semantic nuance; adaptive, data-driven clustering remains an area for improvement (Jiang et al., 2023, Liu et al., 2024).
Relation Diversity and Grounding: Region-aware prompt mining via LLMs introduces factual or diversity limits, especially in long-tail compositional scenarios (Liu et al., 2024).
Implicit Bias and Noise Propagation: Latency and noise emerge from reliance on foundation models; hallucinations or category under-specification can propagate through the graph hierarchy (Deng et al., 2024).
Learning-Based Refinement: Many current HOV-SG frameworks use frozen vision–language backbones; end-to-end fine-tuning, contrastive objectives on open-vocab relation/attribute heads, and GNN-based graph refinement are promising directions (Jiang et al., 2023, Puigjaner et al., 2 Feb 2026, Deng et al., 2024).
Real-Time Constraints and Planner Integration: LLM-based planning logic is frequently offboard and incurs latency; on-device quantized LLM integration and closed-loop fusion are critical for scaling to high-frequency adaptive interaction (Devarakonda et al., 2024).

Ongoing work targets graph-neural extensions, richer affordance and functional edges, continual learning across agents, and extension to 4D spatiotemporal HOV-SG structures.

In summary, Hierarchical Open-Vocabulary Scene Graphs constitute a unifying formalism with demonstrated superiority in open-world vision-language tasks, embodied reasoning, and scalable robot scene understanding. By integrating multi-level abstraction, open-vocabulary semantics, and recent advances in LLMs and VLMs, HOV-SG frameworks are foundational components for the next generation of interpretable, robust, and generalizable perception and reasoning systems (Werby et al., 2024, Puigjaner et al., 2 Feb 2026, Liu et al., 2024, Sun et al., 15 Feb 2025, Xu et al., 2024, Devarakonda et al., 2024, Jiang et al., 2023, Deng et al., 2024, Steinke et al., 11 Mar 2025, Dai et al., 17 Apr 2025).