Papers
Topics
Authors
Recent
Search
2000 character limit reached

Open-Vocabulary 3D Scene Graph

Updated 30 March 2026
  • Open-Vocabulary 3D Scene Graph is a framework for building semantic, relational 3D representations that leverage pretrained vision and language models for zero-shot recognition.
  • It integrates multi-view 3D data with language-grounded features to enable robust scene understanding and dynamic, open-set object categorization.
  • Practical applications include embodied robotics, teleoperation, and scalable semantic mapping in complex indoor and outdoor environments.

Open-Vocabulary 3D Scene Graph (OVSG) is a paradigm for constructing semantic, relational representations of 3D environments that support zero-shot generalization to novel object categories and relationships by leveraging pretrained vision–LLMs (VLMs) and LLMs. OVSG architectures replace fixed class vocabularies with open-ended, language-grounded symbolic nodes and edges, enabling embodied agents to comprehend, traverse, and manipulate novel, dynamic, and unstructured scenes.

1. Formal Definition and Problem Space

An Open-Vocabulary 3D Scene Graph is typically formulated as a graph G=(V,E)G = (V, E), where nodes VV are 3D-anchored entities—objects, rooms, or agents—each with open-vocabulary semantic, geometric, and visual descriptors, and EE consists of edges representing explicit semantic or spatial relationships predicted without restriction to a closed set. Each node viVv_i \in V stores:

  • A 3D bounding box or geometric anchor (pip_i, bib_i)
  • One or more language-derived semantic embeddings (fif_i, cic_i)
  • Open-text labels or captions

Edges eijEe_{ij} \in E encode relationships, which may be spatial (“on,” “next to”), functional, or context-dependent, with relationship features often produced via VLM or LLM inference, and can carry rich, open-vocabulary natural language descriptions or joint embeddings (Gu et al., 2023, Yan et al., 2024, Saxena et al., 24 Oct 2025, Yu et al., 8 Nov 2025, Koch et al., 2024).

The core problem tackled by OVSG is open-world, symbolically-grounded 3D scene representation: given 3D sensor data (RGB-D video, LiDAR, or point clouds), the system must detect, localize, classify, and relate entities using vocabularies unrestricted by the training corpus, operating in dynamic, partially observed, or highly novel environments.

2. Core Methodological Principles

2.1 Open-Vocabulary Perception via Foundation Models

OVSG systems rely fundamentally on foundation VLMs (e.g., CLIP, Grounding DINO, OpenSeg, BLIP2, LLaVA, Uni3D) to produce joint visual-linguistic feature spaces supporting zero-shot recognition. Detected regions or 2D/3D masks are scored by cosine similarity with arbitrary user- or context-specified text prompts, enabling free-form category prediction with

si(c)=cos(ϕi,tc)=ϕi,tcϕitcs_i(c) = \cos(\phi_i, t_c) = \frac{\langle \phi_i, t_c \rangle}{\|\phi_i\|\|t_c\|}

where ϕi\phi_i is the node embedding and tct_c is the embedding of candidate label cc (Gu et al., 2023, Kassab et al., 2024, Wang et al., 2024, Steinke et al., 11 Mar 2025, Saxena et al., 24 Oct 2025, Koch et al., 2024). This open-set scoring is extensible to relationships, with predicate embeddings derived from LLM or VLM inference over textually-defined relation templates.

2.2 Multi-View and 3D Association

Core to OVSG is the aggregation of object observations across multiple RGB(-D) views or 3D fragments. Masks are associated and fused by comparing feature similarity (CLIP/DINO embeddings), geometric overlap (IoU of 3D bounding boxes), or hybrid affinity scores (Gu et al., 2023, Wang et al., 2024, Xu et al., 2024). To associate a view or fragment to an existing node, systems use attention-weighted or exponential moving-average updates:

ϕi(1λ)ϕi+λf2D\phi_i \leftarrow (1-\lambda)\phi_i + \lambda f^*_{2D}

with f2Df^*_{2D} being a view embedding; node creation and merging thresholds are crucial (Kassab et al., 2024, Linok et al., 16 Jul 2025).

2.3 Scene Graph Construction and Edge Prediction

Graph construction involves instantiating a node per cluster of multi-view, multi-modal aligned observations, each with aggregated descriptors. Edges are either inferred with geometric heuristics (e.g., based on spatial proximity, occlusion, or support) or by LLM/VLM reasoning given node captions and context (prompt templates: "What is the relationship between [subject] and [object]?") (Koch et al., 2024, Yu et al., 8 Nov 2025). Edge features often include spatial offsets, direction vectors, or explicit semantic labels.

2.4 Incremental, Hierarchical, and Dynamic Graph Updates

Leading work emphasizes efficient, incremental ovsg construction, supporting online updates, dynamic entity tracking, and local subgraph adjustments. Techniques include:

3. Canonical Pipelines and System Architectures

System Input Perception Backbone 3D Graph Nodes Edge Semantics Update Strategy
ConceptGraphs RGB-D DINO, SAM, CLIP Fused object view LLM-inferred, geometric Incremental, view aggregation
ZING-3D RGB-D VLM + Grounded-SAM Open class, centroid, embedding Edge = relation + metric distance Incremental, semantic+spatial
Open3DSG Point cloud + images PointNet, OpenSeg, BLIP 3D GNN node, CLIP proj LLM-generated open-set (QV) relations Batch, distillation-guided
OGScene3D RGB-D Gaussian Splatting + CLIP Clustered Gaussians Edge via LLM prompts, proximity Progressive, semantic confidence
Point2Graph Point cloud RoomFormer, Uni3D Room/object “contains,” plus post-hoc spatial Batch, room-object hierarchy
LOST-3DSG RGB-D Word2Vec, Sentence-BERT Semantic attrs Hierarchical (“support,” “belongs-to”) Lightweight semantic criteria
BBQ RGB-D MobileSAMv2, DINO, LLaVA Object-centric Fully-connected metric edges, optional semantic Deductive pruning via LLM

OVSG systems integrate:

  1. A perception backbone that produces 2D/3D object segmentation and multi-modal embeddings.
  2. Association mechanisms for fusing object evidence over time and viewpoint.
  3. Open vocabulary recognition for both nodes and relationship prediction.
  4. Scene graph construction with dynamic, incremental or hierarchical updates.
  5. Optionally, progressive LLM/LLM-driven entity, edge, or subgraph reasoning.

4. Evaluation Protocols and Empirical Results

Standard OVSG benchmarks employ:

Quantitative results consistently show:

Selected metrics from recent work:

System Node Prec. Edge Prec. Open-vocab R@5 (Obj) Triplet R@50 Sr3D [email protected] Realtime (fps)
ZING-3D 0.97 0.96–0.98
Open3DSG 0.57 0.64
BBQ 0.23 1–1.5
ConceptGraphs 0.71 0.88 0.08 0.16
LOST-3DSG n/a

Empirical evaluation emphasizes zero-shot generalization, efficient scaling to large or dynamic environments, and capabilities in downstream robotic planning and language-based interaction (Yan et al., 2024, Steinke et al., 11 Mar 2025, Yu et al., 8 Nov 2025, Zhu et al., 17 Mar 2026).

5. Application Domains and System Integration

OVSG has been adopted for:

Deployments typically utilize ROS environments, fusion with SLAM backends, and distributed computation with runtime rates from 0.5–5 Hz (scene construction bottlenecked by 2D segmentation/foundation model inference).

6. Design Tradeoffs, Limitations, and Open Problems

Current design challenges and limitations include:

  • Tradeoffs between semantic fidelity and computational efficiency. Use of cheap geometric heuristics or minimal region-growing for segmentation matches most accuracy of more expensive full 2D/3D segmentation pipelines (Kassab et al., 2024).
  • Over-segmentation or under-segmentation in large, flat, or cluttered structures (Wang et al., 2024, Kassab et al., 2024).
  • Ambiguity in object labeling due to synonymy (“couch” vs “sofa”) and label drift in VLM/LLM outputs (Kassab et al., 2024).
  • Non-adaptive hard thresholds for merging, association, or candidate pair filtering, which may degrade under high scene diversity (Yu et al., 8 Nov 2025).
  • Limited modeling of dynamic, articulated, or deformable entities (most approaches treat objects as rigid with persistent labels) (Zhu et al., 17 Mar 2026).
  • Scaling in real-world or high-entity-count scenes: reduced granularity in spatial or hierarchical reasoning, subgraph selection for LLM-based planning, and bottlenecks in foundation model inference (Wang et al., 27 Sep 2025, Yu et al., 8 Nov 2025).

Proposed research extensions include learning-based adaptive association rules, richer relationship modeling (beyond spatial), online segmentation with semantic feedback, and integration of richer affordance/functional attributes (Steinke et al., 11 Mar 2025, Wang et al., 2024, Kassab et al., 2024, Yan et al., 2024, Ferraina et al., 6 Jan 2026, Zhu et al., 17 Mar 2026).

7. Outlook and Future Directions

OVSG stands as a key enabler for scalable semantic reasoning in open environments. Future research is positioned to exploit:

  • End-to-end differentiable scene graph construction with fully incremental, uncertainty-aware, lifelong adaptation (Zhu et al., 17 Mar 2026).
  • Tighter integration of LLM-based multi-step reasoning for complex linguistic and physical context queries.
  • Expansion of graph attribute spaces to include dynamic behavior, agent intent, or high-level functional affordances.
  • Deployment in outdoor, dynamic, and large-scale real-world multi-agent settings, extending current hierarchical and collaborative OVSG frameworks (Steinke et al., 11 Mar 2025, Yu et al., 8 Nov 2025).
  • Unified benchmarking on manipulation, navigation, and human–robot interaction tasks under explicit open-vocabulary and compositional query metrics (Linok et al., 2024, Chang et al., 2023).

The convergence of foundation model-based perception with graph-based scene abstraction defines the technological trajectory of OVSG, supporting robust, interpretable, and adaptive embodied AI reasoning in complex 3D worlds.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (15)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Open-Vocabulary 3D Scene Graph (OVSG).