Open-Vocabulary 3D Scene Graph
- Open-Vocabulary 3D Scene Graph is a framework for building semantic, relational 3D representations that leverage pretrained vision and language models for zero-shot recognition.
- It integrates multi-view 3D data with language-grounded features to enable robust scene understanding and dynamic, open-set object categorization.
- Practical applications include embodied robotics, teleoperation, and scalable semantic mapping in complex indoor and outdoor environments.
Open-Vocabulary 3D Scene Graph (OVSG) is a paradigm for constructing semantic, relational representations of 3D environments that support zero-shot generalization to novel object categories and relationships by leveraging pretrained vision–LLMs (VLMs) and LLMs. OVSG architectures replace fixed class vocabularies with open-ended, language-grounded symbolic nodes and edges, enabling embodied agents to comprehend, traverse, and manipulate novel, dynamic, and unstructured scenes.
1. Formal Definition and Problem Space
An Open-Vocabulary 3D Scene Graph is typically formulated as a graph , where nodes are 3D-anchored entities—objects, rooms, or agents—each with open-vocabulary semantic, geometric, and visual descriptors, and consists of edges representing explicit semantic or spatial relationships predicted without restriction to a closed set. Each node stores:
- A 3D bounding box or geometric anchor (, )
- One or more language-derived semantic embeddings (, )
- Open-text labels or captions
Edges encode relationships, which may be spatial (“on,” “next to”), functional, or context-dependent, with relationship features often produced via VLM or LLM inference, and can carry rich, open-vocabulary natural language descriptions or joint embeddings (Gu et al., 2023, Yan et al., 2024, Saxena et al., 24 Oct 2025, Yu et al., 8 Nov 2025, Koch et al., 2024).
The core problem tackled by OVSG is open-world, symbolically-grounded 3D scene representation: given 3D sensor data (RGB-D video, LiDAR, or point clouds), the system must detect, localize, classify, and relate entities using vocabularies unrestricted by the training corpus, operating in dynamic, partially observed, or highly novel environments.
2. Core Methodological Principles
2.1 Open-Vocabulary Perception via Foundation Models
OVSG systems rely fundamentally on foundation VLMs (e.g., CLIP, Grounding DINO, OpenSeg, BLIP2, LLaVA, Uni3D) to produce joint visual-linguistic feature spaces supporting zero-shot recognition. Detected regions or 2D/3D masks are scored by cosine similarity with arbitrary user- or context-specified text prompts, enabling free-form category prediction with
where is the node embedding and is the embedding of candidate label (Gu et al., 2023, Kassab et al., 2024, Wang et al., 2024, Steinke et al., 11 Mar 2025, Saxena et al., 24 Oct 2025, Koch et al., 2024). This open-set scoring is extensible to relationships, with predicate embeddings derived from LLM or VLM inference over textually-defined relation templates.
2.2 Multi-View and 3D Association
Core to OVSG is the aggregation of object observations across multiple RGB(-D) views or 3D fragments. Masks are associated and fused by comparing feature similarity (CLIP/DINO embeddings), geometric overlap (IoU of 3D bounding boxes), or hybrid affinity scores (Gu et al., 2023, Wang et al., 2024, Xu et al., 2024). To associate a view or fragment to an existing node, systems use attention-weighted or exponential moving-average updates:
with being a view embedding; node creation and merging thresholds are crucial (Kassab et al., 2024, Linok et al., 16 Jul 2025).
2.3 Scene Graph Construction and Edge Prediction
Graph construction involves instantiating a node per cluster of multi-view, multi-modal aligned observations, each with aggregated descriptors. Edges are either inferred with geometric heuristics (e.g., based on spatial proximity, occlusion, or support) or by LLM/VLM reasoning given node captions and context (prompt templates: "What is the relationship between [subject] and [object]?") (Koch et al., 2024, Yu et al., 8 Nov 2025). Edge features often include spatial offsets, direction vectors, or explicit semantic labels.
2.4 Incremental, Hierarchical, and Dynamic Graph Updates
Leading work emphasizes efficient, incremental ovsg construction, supporting online updates, dynamic entity tracking, and local subgraph adjustments. Techniques include:
- Node matching via feature + spatial cost to enable merging, splitting, or updating nodes as new input arrives (Saxena et al., 24 Oct 2025, Yu et al., 8 Nov 2025, Yan et al., 2024, Ferraina et al., 6 Jan 2026)
- Hierarchical scene graph representations, decomposing environments into multi-level (floor/room/object) structured graphs, supporting both intra- and inter-layer reasoning (Linok et al., 16 Jul 2025, Steinke et al., 11 Mar 2025)
- Temporal tracking and latency-tagged graphs enrich OVSGs for applications in teleoperation and dynamic scenes, using temporal matching costs (e.g., centroid distance and embedding similarity) and latent state tags (Wang et al., 27 Sep 2025, Ferraina et al., 6 Jan 2026)
3. Canonical Pipelines and System Architectures
| System | Input | Perception Backbone | 3D Graph Nodes | Edge Semantics | Update Strategy |
|---|---|---|---|---|---|
| ConceptGraphs | RGB-D | DINO, SAM, CLIP | Fused object view | LLM-inferred, geometric | Incremental, view aggregation |
| ZING-3D | RGB-D | VLM + Grounded-SAM | Open class, centroid, embedding | Edge = relation + metric distance | Incremental, semantic+spatial |
| Open3DSG | Point cloud + images | PointNet, OpenSeg, BLIP | 3D GNN node, CLIP proj | LLM-generated open-set (QV) relations | Batch, distillation-guided |
| OGScene3D | RGB-D | Gaussian Splatting + CLIP | Clustered Gaussians | Edge via LLM prompts, proximity | Progressive, semantic confidence |
| Point2Graph | Point cloud | RoomFormer, Uni3D | Room/object | “contains,” plus post-hoc spatial | Batch, room-object hierarchy |
| LOST-3DSG | RGB-D | Word2Vec, Sentence-BERT | Semantic attrs | Hierarchical (“support,” “belongs-to”) | Lightweight semantic criteria |
| BBQ | RGB-D | MobileSAMv2, DINO, LLaVA | Object-centric | Fully-connected metric edges, optional semantic | Deductive pruning via LLM |
OVSG systems integrate:
- A perception backbone that produces 2D/3D object segmentation and multi-modal embeddings.
- Association mechanisms for fusing object evidence over time and viewpoint.
- Open vocabulary recognition for both nodes and relationship prediction.
- Scene graph construction with dynamic, incremental or hierarchical updates.
- Optionally, progressive LLM/LLM-driven entity, edge, or subgraph reasoning.
4. Evaluation Protocols and Empirical Results
Standard OVSG benchmarks employ:
- Replica, ScanNet, HM3D, 3DSSG for semantic/instance segmentation and open-vocabulary retrieval
- Sr3D/Nr3D/ScanRefer for language-grounded object localization tasks
- Real-world robot deployments for manipulation, navigation, and teleoperation (Zhu et al., 17 Mar 2026, Chang et al., 2023, Saxena et al., 24 Oct 2025, Xu et al., 2024, Yan et al., 2024)
Quantitative results consistently show:
- Node/edge precision reaching (ZING-3D on Replica/HM3D) (Saxena et al., 24 Oct 2025)
- Triplet (subject–predicate–object) recall approaching $0.66$–$0.78$ in zero-shot (Yu et al., 8 Nov 2025, Koch et al., 2024)
- Robust performance on novel classes and relationships (tail object accuracy >0.4 in Open3DSG) (Koch et al., 2024)
- OVSG methods outperform closed-vocabulary or image-fusion baselines on context-sensitive language tasks (Chang et al., 2023, Linok et al., 2024)
Selected metrics from recent work:
| System | Node Prec. | Edge Prec. | Open-vocab R@5 (Obj) | Triplet R@50 | Sr3D [email protected] | Realtime (fps) |
|---|---|---|---|---|---|---|
| ZING-3D | 0.97 | 0.96–0.98 | – | – | – | – |
| Open3DSG | – | – | 0.57 | 0.64 | – | – |
| BBQ | – | – | – | – | 0.23 | 1–1.5 |
| ConceptGraphs | 0.71 | 0.88 | – | – | 0.08 | 0.16 |
| LOST-3DSG | – | – | – | – | – | n/a |
Empirical evaluation emphasizes zero-shot generalization, efficient scaling to large or dynamic environments, and capabilities in downstream robotic planning and language-based interaction (Yan et al., 2024, Steinke et al., 11 Mar 2025, Yu et al., 8 Nov 2025, Zhu et al., 17 Mar 2026).
5. Application Domains and System Integration
OVSG has been adopted for:
- Embodied robotic navigation and manipulation in dynamic indoor and outdoor scenes, leveraging open-vocabulary graphs for long-horizon planning, robust context tracking, and language-driven subtask grounding (Yu et al., 8 Nov 2025, Yan et al., 2024, Zhu et al., 17 Mar 2026).
- Mobile multi-agent mapping and fusion: collaborative inference in unaligned coordinate frames using CURB-OSG (Steinke et al., 11 Mar 2025).
- Teleoperation: integration of spatio-temporal OVSGs with latency-aware language planners for robust control under real-world delays (Wang et al., 27 Sep 2025).
- Large-scale semantic mapping: scalable OVSGs employing data- and memory-efficient representations (e.g., adaptive octree graphs with semantic confidence) (Wang et al., 2024).
- Explicit object, region, and agent grounding and context disambiguation for natural-language queries in repetitive or ambiguous environments (Chang et al., 2023, Linok et al., 2024, Saxena et al., 24 Oct 2025).
Deployments typically utilize ROS environments, fusion with SLAM backends, and distributed computation with runtime rates from 0.5–5 Hz (scene construction bottlenecked by 2D segmentation/foundation model inference).
6. Design Tradeoffs, Limitations, and Open Problems
Current design challenges and limitations include:
- Tradeoffs between semantic fidelity and computational efficiency. Use of cheap geometric heuristics or minimal region-growing for segmentation matches most accuracy of more expensive full 2D/3D segmentation pipelines (Kassab et al., 2024).
- Over-segmentation or under-segmentation in large, flat, or cluttered structures (Wang et al., 2024, Kassab et al., 2024).
- Ambiguity in object labeling due to synonymy (“couch” vs “sofa”) and label drift in VLM/LLM outputs (Kassab et al., 2024).
- Non-adaptive hard thresholds for merging, association, or candidate pair filtering, which may degrade under high scene diversity (Yu et al., 8 Nov 2025).
- Limited modeling of dynamic, articulated, or deformable entities (most approaches treat objects as rigid with persistent labels) (Zhu et al., 17 Mar 2026).
- Scaling in real-world or high-entity-count scenes: reduced granularity in spatial or hierarchical reasoning, subgraph selection for LLM-based planning, and bottlenecks in foundation model inference (Wang et al., 27 Sep 2025, Yu et al., 8 Nov 2025).
Proposed research extensions include learning-based adaptive association rules, richer relationship modeling (beyond spatial), online segmentation with semantic feedback, and integration of richer affordance/functional attributes (Steinke et al., 11 Mar 2025, Wang et al., 2024, Kassab et al., 2024, Yan et al., 2024, Ferraina et al., 6 Jan 2026, Zhu et al., 17 Mar 2026).
7. Outlook and Future Directions
OVSG stands as a key enabler for scalable semantic reasoning in open environments. Future research is positioned to exploit:
- End-to-end differentiable scene graph construction with fully incremental, uncertainty-aware, lifelong adaptation (Zhu et al., 17 Mar 2026).
- Tighter integration of LLM-based multi-step reasoning for complex linguistic and physical context queries.
- Expansion of graph attribute spaces to include dynamic behavior, agent intent, or high-level functional affordances.
- Deployment in outdoor, dynamic, and large-scale real-world multi-agent settings, extending current hierarchical and collaborative OVSG frameworks (Steinke et al., 11 Mar 2025, Yu et al., 8 Nov 2025).
- Unified benchmarking on manipulation, navigation, and human–robot interaction tasks under explicit open-vocabulary and compositional query metrics (Linok et al., 2024, Chang et al., 2023).
The convergence of foundation model-based perception with graph-based scene abstraction defines the technological trajectory of OVSG, supporting robust, interpretable, and adaptive embodied AI reasoning in complex 3D worlds.