Neuro-Symbolic Scene Graphs

Updated 4 January 2026

Neuro-symbolic scene graphs are structured models that combine neural perception with explicit symbolic reasoning to represent objects, attributes, and relations in visual scenes.
They utilize a modular pipeline—including perceptual backbones, relation estimation, and graph construction—to fuse deep learning outputs with ontology-based reasoning for improved interpretability.
Applications range from robotics and visual question answering to synthetic data generation and autonomous driving, demonstrating significant gains in accuracy and planning efficiency.

Neuro-symbolic scene graphs are structured representations that combine neural perception with explicit symbolic modeling of objects, attributes, and relations in visual scenes. This paradigm integrates the statistical learning capabilities of deep neural networks with the discrete, logic-based, and often ontology-constrained symbolic reasoning inherited from knowledge representation frameworks. Neuro-symbolic scene graphs have become a central abstraction in a range of tasks, including symbolic image understanding, visual question answering, robotics, synthetic data generation, and assured autonomy. The following sections detail architectural principles, canonical methodologies, benchmark results, and emerging research directions.

1. Foundations: Neuro-Symbolic Scene Graph Formalisms

Neuro-symbolic scene graphs invariably instantiate a directed graph $G=(V, E, \Lambda)$ encoding objects as nodes, attributed with class labels and spatial/semantic properties, and edges as labeled binary (or higher-arity) relations from a designated predicate set $\Lambda$ (Kalanat et al., 2022, Savazzi et al., 21 Mar 2025, Zhu et al., 2020, Wickramarachchi et al., 2024, Herzog et al., 9 Apr 2025). This definition admits several variants:

Classical scene graphs: $G = (V, E)$ , where $V$ is a set of detected object instances and $E$ encodes relations with open or fixed vocabularies (e.g., “on,” “holding,” “left of”).
Domain-conditioned scene graphs: $G=(V, E, \Lambda)$ with $\Lambda$ and the object-type set $D_t$ given by a target task's domain ontology (e.g., PDDL predicates for planning) (Herzog et al., 9 Apr 2025).
Compound scene graphs: Augment standard scene graphs with a domain knowledge graph $G_k$ ; joint inference is performed over $G_s \times G_k$ to support high-level concept recognition (e.g., identifying scenes as “kitchen” from constituent parts) (Aryan et al., 2024).
Belief scene graphs: Extend $G$ with “blind” probabilistic nodes and marginal distributions to handle uncertainty over unobserved objects (Saucedo et al., 5 May 2025).
Spatio-temporal scene graphs: Augment $G$ with temporal indices or edges for each predicate, yielding $STSGs$ as in LASER (Huang et al., 2023) and SGCLIP-based pipelines (Huang et al., 11 Oct 2025).

These formal models serve as the backbone for downstream neural-symbolic task pipelines.

2. Architectural Patterns and Neuro-Symbolic Integration

Neuro-symbolic scene graph pipelines generally comprise the following modular stages:

Perceptual Backbone: Neural networks (often CNNs, ViTs, or object detectors like VinVL, YOLOv3, Florence-VL, CLIP) produce object regions, class logits, and/or low-level attributes (Kalanat et al., 2022, Eiter et al., 2022, Jahangard et al., 30 Oct 2025, Hallyburton et al., 27 May 2025).
Relation and Attribute Estimation: Learned (often MLP-based) heads or geometric-spatial logic modules classify pairwise relations; softmax outputs are discretized for symbolic import (Kalanat et al., 2022, Hallyburton et al., 27 May 2025, Jahangard et al., 30 Oct 2025).
Graph Construction: Nodes and edges are assembled into $G$ based on detection outputs and relation thresholds. Optional symbolic predicates are imposed through domain-specific knowledge graphs or ontologies (Herzog et al., 9 Apr 2025, Zhu et al., 2020).
Neural Symbolic Embeddings: Graph Convolutional Networks (GCNs) or general GNNs operate over $G$ , either for pooling node-level features into global scene embeddings or for message passing that fuses geometric and symbolic information (Kalanat et al., 2022, Zhu et al., 2020, Saucedo et al., 5 May 2025).
Fusion and Integration Mechanisms: Scene-level and knowledge-level embeddings can be fused through concatenation, elementwise product, or attention-weighted schemes (e.g., Hadamard fusion and attention in SKG-Sym/ASKG-Sym) (Kalanat et al., 2022).
Symbolic Reasoning and Constraint Satisfaction: Logical modules, ASP solvers, or differentiable symbolic reasoners (e.g., LTL $_f$ checker in LASER, constraint engines in NeuSPaPer) enforce ontology-based, temporal, or commonsense constraints on graph predictions (Huang et al., 2023, Hallyburton et al., 27 May 2025).

A representative neuro-symbolic workflow for symbolic image detection is detailed in SKG-Sym, which first builds a scene graph ( $G_{sg}$ ) and a ConceptNet-derived knowledge graph ( $G_{kg}$ ), encodes both using GCNs, fuses their representations via concatenation and attention, and classifies the fused embedding with a lightweight MLP (Kalanat et al., 2022).

3. Neuro-Symbolic Scene Graphs in Learning and Inference

3.1. Probabilistic and Commonsense Extensions

Belief scene graphs explicitly encode spatial uncertainty through node-localized heatmaps, learned by GCNs operating on both observed and blind nodes. A neuro-symbolic variant (BASE+ONT) instantiates priors derived from LLM-based ontologies, significantly reducing Wasserstein, energy, and Frobenius distances between predicted and true spatial distributions; this establishes a pipeline for commonsense scene composition and robust object localization in partially observed environments (Saucedo et al., 5 May 2025).

3.2. Weakly-Supervised Learning

LASER demonstrates weakly-supervised STSG learning from video captions by converting captions to logical spatio-temporal specifications ( $\psi$ ) and aligning probabilistic scene graph outputs with $\psi$ through a differentiable symbolic reasoner. Auxiliary losses enforce semantic consistency and temporal alignment, yielding gains of +12.65% unary predicate accuracy and +0.22 binary recall@5 over the best fully supervised baselines on OpenPVSG (Huang et al., 2023).

3.3. Promptable Scene Graph Generation

SGCLIP, at the core of ESCA, supports open-vocabulary, prompt-based inference by decoding scene graphs with unary and binary facts scored via CLIP alignment between region and concept embeddings. Self-supervised alignment leverages programmatic specifications compiled from synthetic captions and temporal LTL-like predicates, eliminating the need for human-annotated scene graphs. Gains include +7.4% zero-shot recall@1 on OpenPVSG and a reduction in embodied agent perception error rates from 69% to 30% in navigation (Huang et al., 11 Oct 2025).

4. Application Domains

4.1. Robotics and Planning

Neuro-symbolic scene graphs underpin hierarchical planning and state grounding in robotics. For long-horizon manipulation, dual-layer graphs (geometric and symbolic) enable high-level regression planning and low-level motion generation. Relative to PDDLStream-based planning, GNN-based neuro-symbolic planners achieve a 10,000x speedup in task planning and a 70.6% final success rate on real-robot trials (Zhu et al., 2020). In domain-conditioned scene graphs, predicates and object types are mapped directly to PDDL symbols, producing task representations with near-perfect grounding precision/recall and 3–4x higher planning success rates than multimodal LLM-based baselines (Herzog et al., 9 Apr 2025).

4.2. Synthetic Data Generation

SGAdapter introduces symbolic conditioning into latent diffusion models used for synthetic dataset generation, with scene graphs explicitly controlling object arrangements and predicate compliance. Empirically, SGC-Att + S-Att masking yields up to +2.59% and +2.83% improvements in Recall and No-Graph-Constraint Recall, respectively, over text-only Stable Diffusion baselines in scene graph generation tasks; neuro-symbolic data augmentation mitigates real data scarcity for complex visual reasoning (Savazzi et al., 21 Mar 2025).

4.3. Visual Question Answering (VQA) and Scene Understanding

ASP-based neuro-symbolic pipelines transform neural detections into logic programs for robust VQA in the CLEVR setting. Non-deterministic scene encodings enable answer accuracy to reach 96.9% even with moderately trained object detectors, and answer set solvers run 50–100x faster than general probabilistic reasoning systems (Eiter et al., 2022). In compound scene understanding, jointly searching over scene and knowledge graphs yields 96.3% top-1 accuracy on the ADE20K dataset, matching the best closed-source visual transformer models but with transparent reasoning steps (Aryan et al., 2024).

4.4. Autonomous Driving and Assured Autonomy

Neuro-symbolic scene graphs are operationalized in assured autonomy for cyber-physical systems via multi-sensor fusion, symbolic integrity checks, and graph-based situational awareness. Systems such as NeuSPaPer integrate knowledge graph embeddings, constraint engines, and symbolic message-passing into both per-sensor and cross-sensor pipelines, effecting anomaly detection, resilience to adversarial attacks, and logical consistency across fusion outputs (Hallyburton et al., 27 May 2025). Real-world driving datasets are now organized into knowledge graphs (e.g., DSceneKG), supporting causal reasoning, entity typing, similarity search, and cross-modal retrieval tasks (Wickramarachchi et al., 2024).

5. Evaluation Metrics, Benchmarks, and Empirical Findings

Image Symbolism F1: In symbolic image detection, neuro-symbolic methods (SKG-Sym, ASKG-Sym) outperform ResNet-50 and match VGG16's F1 with substantially fewer parameters (14.09–14.86% F1 vs. 12.5–14.9%) while demonstrating that combined scene and knowledge graphs outperform either alone (Kalanat et al., 2022).
Recall@K for Scene Graph Generation: NeSy-conditioned synthetic images improve downstream scene graph model recall by up to 2.59% (standard) and 2.83% (NG-R) (Savazzi et al., 21 Mar 2025).
Grounding and Planning Success: Domain-aligned scene graphs attain 1.00 precision and recall in state grounding and up to 94% planning success in “Cooking” tasks, in contrast to ≤37% for prior LMM-based approaches (Herzog et al., 9 Apr 2025).
Spatial Grounding in Robotics: Interpretability and response times are improved with a 1.3B parameter neurosymbolic model, outperforming 2–9B parameter VLM counterparts by up to 96.7% mAP and 67.8% mIoU on attribute and relation queries (Jahangard et al., 30 Oct 2025).

Key ablation findings include the necessity of both cross-attention and relational self-attention in symbolic adapters, the benefit of ontology-driven priors for spatial predictions, and the value of non-deterministic symbolic encoding for robust VQA under imperfect perception.

6. Limitations and Future Directions

Knowledge Graph Scaling: Manual construction of knowledge graphs and ontologies (e.g., for ADE20K or symbolic predicates) is a bottleneck; data-driven KG induction and expansion techniques are needed for open-world coverage (Aryan et al., 2024).
Temporal and Dynamic Scene Graphs: Most current models operate on static frames or single time steps; extension to fully spatio-temporal, causal, and dynamic graphs remains active research, particularly for embodied and multi-agent environments (Huang et al., 2023, Huang et al., 11 Oct 2025).
Task-Specific and End-to-End Learning: While neuro-symbolic modules are highly modular, joint end-to-end optimization and adaptation to domain-specific requirements (e.g., manipulation affordances, rare relations) are ongoing objectives.
Probabilistic and Commonsense Reasoning: Progress on integrating uncertainty (belief graphs) and richer language-based spatial ontologies demonstrates improved commonsense scene composition, but full probabilistic reasoning with partial and noisy data is not yet ubiquitous (Saucedo et al., 5 May 2025).
Efficiency at Scale: Real-time robotics and cyber-physical deployment require further optimization of scene graph inference and reasoning modules; lightweight, efficient architectures and hardware-aware design are of increasing concern (Hallyburton et al., 27 May 2025, Jahangard et al., 30 Oct 2025).

Neuro-symbolic scene graphs unify statistical perception and explicit symbolic knowledge, forming an interpretable, compositional, and robust interface for a diverse spectrum of computer vision, robotics, and cognitive AI tasks. Ongoing work targets scalability, spatio-temporal abstraction, and seamless integration into complex real-world and embodied systems.