ConceptGraphs: Open-Vocabulary 3D Scenes

Updated 15 April 2026

ConceptGraphs are open-vocabulary, graph-structured representations for 3D scenes that unify object-level data, semantic embeddings, and spatial relations for enhanced scene understanding.
The framework employs 2D segmentation, CLIP-based semantic encoding, and 3D fusion to create persistent object nodes and dynamically predicted relational edges.
Applications in robotics and planning benefit from real-time LLM-driven reasoning, though current limitations include flat graph structures and potential token inefficiencies.

ConceptGraphs are open-vocabulary, graph-structured representations for 3D scenes that fuse outputs of foundation models and multi-view association to encode rich semantic and spatial relations for perception and planning. ConceptGraphs unify object-level 3D data association, semantic embedding, and relation extraction using advances in vision-LLMs and LLMs, enabling generalization to novel semantic classes without the need for task-specific training or manual annotation (Gu et al., 2023). Originally introduced for robotics and autonomous perception, the ConceptGraphs framework has also influenced modern interpretable machine learning (e.g., GraphCBMs (Xu et al., 19 Aug 2025)) and catalyzed further development in training-free 3D scene reasoning systems (Feng et al., 11 Nov 2025).

1. Mathematical and Structural Foundations

A ConceptGraph formalizes a scene as a graph $G = (V, E)$ with the following structure:

Nodes $o_j \in V$ $o_{j} \in V$ represent physical objects, each with:
- A 3D point cloud $p_j \subset \mathbb{R}^3$
- A $d$ -dimensional semantic embedding $\phi(o_j) \in \mathbb{R}^d$ (CLIP/DINO feature)
Edges $E \subseteq V \times V$ represent semantic or spatial relations between pairs of objects, with edge features $\psi(u, v) \in \mathbb{R}^k$ denoting the relationship label (e.g., "on top of", "inside"), produced via LLM-based relation prediction or deterministic spatial heuristics.

The construction of ConceptGraphs leverages the pipeline:

2D Region Extraction: Segment every input RGB image using a class-agnostic segmenter (e.g., SegSAM).
Semantic Encoding: Compute region embeddings with a vision-LLM (e.g., CLIP), yielding $f_{t, i} = \mathrm{Embed}_{\text{CLIP}}(\text{crop}(I_t, m_{t,i}))$ .
Open-Vocabulary Labeling: Assign text labels to each region by maximizing cosine similarity with candidate class embeddings.
3D Fusion and Association: Project 2D regions into 3D using depth and pose, denoise via clustering, and associate masks to existing nodes based on semantic and geometric proximity.
Edge Instantiation: Prune candidate edges using 3D IoU and establish spatial predicates or invoke an LLM to generate free-form relationship labels.

This approach maintains scalability by associating 2D detections across views to persistent 3D object nodes, avoiding per-point storage and supporting generalization to open-vocabulary queries (Gu et al., 2023).

2. Open-Vocabulary Semantic Encoding and Relation Modeling

ConceptGraphs utilize foundation models pre-trained on broad image-text pairs to encode semantics in a language-agnostic manner:

Object Tokens: Each detected object receives a feature vector and text tag via CLIP, enabling arbitrary label queries without re-training.
Relation Tokens: For each potential object pair, relation labels are:
- Derived heuristically (e.g., via bounding box overlaps and predefined spatial rules: "above", "next to", "inside").
- Or, generated through LLM prompts that synthesize spatial and semantic context into natural-language labels.

This dual semantic–spatial encoding allows downstream LLMs to perform concept-grounded reasoning over the entire scene graph without symbolic predefinition of classes or relationships (Gu et al., 2023).

3. Graph Construction Algorithms

The canonical ConceptGraphs pipeline is as follows:

$o_j \in V$ 5

This algorithm ensures that concept nodes are created or fused across multiple views in a greedy but computationally efficient fashion. Final graphs are sparse, object-centric, and directly exploit both geometric coherence and semantic similarity for association (Gu et al., 2023).

4. Downstream Reasoning and Applications

ConceptGraphs are designed to interface naturally with LLM-driven perception and planning:

Task grounding: The scene graph, serialized as structured JSON or text, becomes the context window for an LLM which interprets abstract commands (e.g., "find an object to sit on other than a chair") and grounds them to spatial predicates and object instances.
Formal query evaluation: Downstream modules can evaluate logical formulas (e.g., $\exists o \in V: \text{tag}(o) \in \{\mathrm{sofa}, \mathrm{bench}\} \wedge \text{next\_to}(o, \mathrm{table})$ ) over $G$ to identify suitable candidates and return actionable parameters (e.g., $o_j \in V$ 0 pose for navigation).

This flexible paradigm supports complex multi-step queries, semantic scene understanding, and general-purpose task planning in robotics and similar domains (Gu et al., 2023).

5. Comparative Developments and Limitations

ConceptGraphs represent a foundational step in training-free, LLM-augmented scene understanding but exhibit important limitations:

Flat Graph Limitation: All nodes are organized in a non-hierarchical (flat) structure. Every object-object pair can potentially be connected, which can induce redundancy and excessive context length in downstream LLM reasoning.
Task-Irrelevant Context: The static, unfiltered graph is serialized in its entirety for every query, often overwhelming LLMs with irrelevant scene details, raising token count, and impairing both accuracy and efficiency (Feng et al., 11 Nov 2025).
No plane-based hierarchy: Spatial context is limited to local object-object relationships, without large-scale organizational structure.
Performance: On benchmarks such as Space3D-Bench, the flat-graph ConceptGraphs approach achieved EM@1 accuracy of 26.94% and an average inference time of 1.47 s per query (Feng et al., 11 Nov 2025).

Successors such as Sparse3DPR address these limitations by introducing hierarchical scene graphs centered on planes (floors, walls) and dynamic, task-adaptive subgraph filtering. This yields significant improvements: 28.7% higher EM@1 and a 78.2% reduction in inference latency compared to ConceptGraphs (Feng et al., 11 Nov 2025).

6. Relation to Broader Graph-Theoretic and Machine Learning Contexts

The terminology "ConceptGraph" must be distinguished from the general category-theoretic notion of "conceptual graphs" (Grphs), as defined in the study of concrete graph categories (McRae et al., 2012). In this formalism, a conceptual graph $o_j \in V$ 1 is a quadruple $o_j \in V$ 2 where $o_j \in V$ 3 is a set of parts, $o_j \in V$ 4 are vertices, and edges and incidence are characterized by structure-preserving maps. These foundations are mathematically general and encompass various subcategories of graphs relevant in both algebraic and applied settings, but are not directly invoked in the ConceptGraphs implementation for 3D scene understanding.

In interpretable deep learning, ConceptGraphs also inspire latent concept graphs as mediators of structured knowledge, as found in Graph Concept Bottleneck Models (GraphCBMs) (Xu et al., 19 Aug 2025). There, a learnable latent adjacency supports concept-to-concept message passing, enhancing both model accuracy and intervention interpretability through explicit graph-based interactions.

7. Significance and Outlook

ConceptGraphs are notable for enabling open-vocabulary, scalable, and interpretable scene understanding without requiring manual 3D annotation or model fine-tuning. By tying advances in foundation vision-LLMs and prompt-driven LLMs to multi-view 3D graph construction, they provide a template for generalized perception, planning, and reasoning pipelines in robotics and autonomous systems. Current research trajectories suggest increasingly hierarchical, context-adaptive, and graph-sparse evolutions will further improve efficiency and accuracy (Feng et al., 11 Nov 2025). A plausible implication is that future frameworks will continue to couple zero-shot open-vocabulary perception with compositional, symbolic, or logic-based LLM planners, mediated by ever-more expressive and context-sensitive concept graphs.

Markdown Report Issue Upgrade to Chat

References (4)

ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning (2023)

Graph Concept Bottleneck Models (2025)

Sparse3DPR: Training-Free 3D Hierarchical Scene Parsing and Task-Adaptive Subgraph Reasoning from Sparse RGB Views (2025)

On the Concrete Categories of Graphs (2012)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ConceptGraphs.