GeoSceneGraph: Geometric Scene Graphs

Updated 30 April 2026

GeoSceneGraph is a graph-based representation that integrates geometric and semantic data to model physical scenes, with nodes for spatial entities and edges for explicit spatial relationships.
It employs multi-modal fusion of visual, geometric, and language cues via graph neural networks to enhance scene synthesis, robotic planning, and autonomous navigation.
Recent implementations leverage techniques like geometric algebra and diffusion models to achieve improved accuracy in indoor/outdoor scene understanding and context-aware augmentation.

A GeoSceneGraph is a geometric, semantic graph-based representation of a physical scene in which nodes correspond to spatial entities (objects, object-groups, places, or regions) and edges encode explicit spatial, topological, and sometimes behavioral relationships among them. Unlike purely visual or categorical scene graphs, a GeoSceneGraph grounds all semantic relationships in explicit geometric descriptors—such as 3D pose, size, and occupancy—serving as a foundation for tasks ranging from indoor scene synthesis to large-scale autonomous navigation. GeoSceneGraph architectures fuse multi-modal data (e.g., visual, geometric, and language cues) within graph neural networks or other message-passing schemes for end-to-end learning or generative modeling. Recent work demonstrates GeoSceneGraphs’ utility for scene understanding, robotic planning, text-driven 3D synthesis, and context-aware scene augmentation, in both indoor and outdoor environments.

1. Formal Structures and Core Definitions

GeoSceneGraphs are typically realized as directed, attributed graphs $G = (V, E)$ in which:

Nodes $V$ : Each node represents an entity—such as an object, a group, a terrain patch ("place"), or a higher-level region—and carries geometric attributes. For indoor scenes, nodes often encode 3D centroids, ellipsoidal occupancy (e.g., dual quadrics/ellipsoids $Q_i \in \mathbb{R}^{4\times4}$ ), or full geometric algebra (GA) representations. Outdoor graphs include nodes associated with terrain type, surface descriptors, and semantic CLIP embeddings.
Edges $E$ : Edges convey geometric or semantic relationships—e.g., "support," "same-plane," "same-set," or "part-of" in indoor domains (Gay et al., 2018), or "adjacent," "on," and "contains" in outdoor graphs (Samuelson et al., 23 Sep 2025). Edge features often encode relative 3D displacement, planarity, or learned predicate probability vectors.
Hierarchical Structure: Several frameworks (Terra, SceneGen) introduce multi-level organization: object nodes, place nodes (derived from terrain or floorplan skeletons), region nodes (clusters of places), and a possible root map node (Samuelson et al., 23 Sep 2025, Keshavarzi et al., 2020).
Attributes: Node/edge attributes may comprise 3D position, orientation, occupancy, semantic class, object size, terrain roughness/slope, one-hot or CLIP embeddings, and sometimes behavioral or mesh codes (Kamarianakis et al., 2023).

A unifying principle is that all semantics are grounded in spatial structure, and all topology admits explicit geometric quantification.

2. Key Methodologies for GeoSceneGraph Construction

Multi-View and Image-Based Indoor Approaches

VGfM (Gay et al., 2018) exemplifies integrated geometric scene graph construction for static indoor environments. The core pipeline includes:

2D Detection and Association: Per-frame object proposals via region detectors (e.g., Faster-RCNN), cross-view association via geometric consistency.
3D Geometry Recovery: For each object, multi-view bounding ellipses are fit with dual conics; a closed-form solver ("LfDC") computes 3D dual quadrics for object occupancy.
Feature Fusion: Visual features (e.g., VGG16-FC7 activations), geometric descriptors (principal axis extrema), and normalized bounding box offsets are fused within a graph message-passing network (GRU-based).
Relation Classification: Edge features are composed from object pairs, with predicates predicted via neural message-passing, yielding labeled edges such as "support" or "same-plane."
Training: The model is trained end-to-end using cross-entropy (object classes) and binary cross-entropy (relations).

Outdoor and Terrain-Aware Pipelines

Terra and related methods (Samuelson et al., 23 Sep 2025, Samuelson et al., 6 Jun 2025) adapt GeoSceneGraph logic for large-scale, unstructured outdoor scenes:

Metric-Semantic Mapping: LiDAR and RGB images are fused via SLAM for global registration; masks are segmented for terrain and object classes, CLIP embeddings encode semantics.
Place Node Extraction: Terrain-aware generalized Voronoi Diagrams (GVDs) are computed over rasterized occupancy grids; skeleton points become place nodes, encoding terrain parameters (elevation, slope, radius).
Hierarchical Clustering: Place nodes are recursively clustered into regions via agglomerative or spectral methods using a semantically informed affinity metric.
Edge Construction: Adjacency and containment relations are computed based on geometry and proximity. Traversability cost functions are explicitly defined on edges for planning.

Geometric Algebra and Unified Representation

UniSG^GA (Kamarianakis et al., 2023) encodes all spatial (TRS) and behavioral data as GA multivectors, using motors/rotors in 3D PGA/CGA for rigid-body transformations. Nodes are entities, TRS components, or action data; edges carry relative motor features. GraphSAGE-style GNNs operate directly on multivector coefficients, enabling seamless generative and predictive pipelines.

Generative and Diffusion Models

GeoSceneGraph-based text-to-3D generation leverages equivariant graph diffusion and cross-modal fusion (Ruiz et al., 18 Nov 2025):

Latent Graph Diffusion: Node coordinates and features undergo stochastic diffusion, with an EGNN core predicting denoising steps that are $SE(3)$ -equivariant.
Cross-Modal Conditioning: Text prompts (CLIP embeddings) and diffusion time steps are injected into EGNN message passing at every layer using ResNet+Transformer fusions, avoiding early concatenation or single-shot attention, which ablation shows is suboptimal.

3. Spatial and Semantic Relation Encoding

Relations between entities in a GeoSceneGraph are defined via both explicit geometric computations and learned or deterministic postprocessing:

Geometry-Driven Encoding: For each object pair, spatial predicates are derived using 3D ellipsoid overlap, centroid displacement, or GVD connectivity (indoor), or via distance, direction, and contact in 2D or 3D manifolds (outdoor). In SceneGen (Keshavarzi et al., 2020), spatial features such as RoomPosition, AverageDistance, SurroundedBy, and Support are formalized and used as edge attributes in conditional likelihood evaluation.
Orientation and Topology: Orientation-based relations are computed for asymmetric objects (e.g., Facing, NextTo, TowardsCenter) and collected into feature vectors for KDE-based likelihoods.
Graph-Based Neural Updates: Message passing aggregates object features, geometric relations, and visual cues over the graph structure, updating node/edge hidden states (via GRUs, GCNs, or EGNNs).
Hierarchical Edges: Outdoor and mixed scenes structure graphs with region containment and adjacency edges; traversability and navigational cost can be attributed directly to edge weights (Samuelson et al., 23 Sep 2025).

4. Applications and Use Cases

GeoSceneGraphs provide a common representational substrate for diverse domains:

Scene Understanding and Reasoning: Indoors, explicit geometric grounding enables fine-grained spatial queries and robust judgment of support, part-of, and same-plane relations, outperforming appearance-only scene graphs on relation accuracy (Gay et al., 2018).
Generative Scene Synthesis: In text-guided 3D synthesis, EGNN diffusion models conditioned on text embedding produce coherent indoor configurations, with improved iRecall and FID compared to non-graph or non-equivariant baselines (Ruiz et al., 18 Nov 2025). In SceneGen (Keshavarzi et al., 2020), KDE-based priors over position/orientation support context-aware AR content placement.
Autonomous Navigation and Planning: Outdoor GSGs (Terra, GraphMapper) support path planning, object retrieval, and region-based monitoring in large, open environments. Traversability cost on edges directly informs sampling or A* policies; accumulated scene graphs enable modular and explainable navigation (Samuelson et al., 23 Sep 2025, Seymour et al., 2022).
Scene Augmentation and AR: GeoSceneGraph-based frameworks predict plausible object placements and orientations for AR applications, delivering heatmaps and likelihoods for user-driven contextual augmentation (Keshavarzi et al., 2020).

5. Quantitative Evaluation and Empirical Results

Metrics and benchmarks vary by context but commonly include:

Task/Domain	Metric	Key Results	Reference
Indoor SGG	Object/relation accuracy	+1–3% for 3D over 2D, up to +4% with fusion	(Gay et al., 2018)
Generative 3D synthesis	FID, iRecall, rotation-invariant	2-stage fusion: FID=111.2, iRecall_rot=59.5	(Ruiz et al., 18 Nov 2025)
Outdoor object retrieval	IoU, SAcc, F1	F1: 0.194 (Terra, Sparse Full)	(Samuelson et al., 23 Sep 2025)
Region classification (outdoor)	Precision, Recall, F1	F1: 0.471 (agglomerative)	(Samuelson et al., 23 Sep 2025)
SGGen with geometric context	mean-Recall@K (Visual Genome)	+0.1–0.3 overall, +3–6% on “above”, “near”	(Kumar et al., 2021)

Rigorous ablations in text-conditioned generative models show that direct per-layer injection of text/time into EGNNs yields superior geometric consistency and lower FID compared to naive concatenation or single cross-attention (Ruiz et al., 18 Nov 2025). For GNNs operating on GA features, convergence is faster and losses lower than for classic matrix-encoded transforms (Kamarianakis et al., 2023).

6. Limitations, Challenges, and Future Prospects

Several open challenges persist across GeoSceneGraph paradigms:

Scalability: Outdoor 3DSGs must manage millions of points and regions; efficient graph representations and sparse GA implementations are active areas of research (Samuelson et al., 23 Sep 2025, Kamarianakis et al., 2023).
Semantic Drift and Data Sparsity: Rare predicates, dynamic objects, and seasonal appearance changes degrade relation classification, especially for infrequent classes (“part-of”) or in ambiguous layouts (Gay et al., 2018, Samuelson et al., 6 Jun 2025).
Quality of Geometry Recovery: Indoor quadric fitting remains sensitive to occlusion and small-baseline camera motion; joint end-to-end optimization of geometry and semantics is suggested (Gay et al., 2018).
Generalization and Multimodality: Tighter language-conditioned fusion with geometric features, incorporation of material/behavioral properties, and multi-agent extensions are leading directions (Kamarianakis et al., 2023, Ruiz et al., 18 Nov 2025).
Region Ontologies: Hierarchical region clustering and alignment with human-understandable ontologies remain open, and LLM-driven approaches are proposed for semantic generalization (Samuelson et al., 23 Sep 2025).

A plausible implication is that unified, equivariant graph models—supplemented by open-vocabulary semantic representations—will underpin future advancements in interpretable, scalable, and cross-modal geometric reasoning across both virtual and real environments.

7. Historical Development and Relationship to Broader Scene Graph Research

GeoSceneGraph methodology synthesizes developments from several research threads:

Early scene graphs: Focused on semantic and appearance-based relationships; lacked explicit geometric grounding.
Multi-view and depth-aware methods: Initiated explicit 3D geometric relation encoding (VGfM (Gay et al., 2018)).
Generative modeling: Moved beyond static representations to conditional, symmetric, and behaviorally grounded scene synthesis (Ruiz et al., 18 Nov 2025, Kamarianakis et al., 2023).
Outdoor and region-based hierarchies: Leveraged sparse metric-semantic mapping and hierarchical graph construction for large-scale mapping and planning (Samuelson et al., 23 Sep 2025, Samuelson et al., 6 Jun 2025).
Geometric algebra and unified object-relation encoding: Established compact and transformation-consistent feature representation for GNN pipelines (Kamarianakis et al., 2023).

The GeoSceneGraph paradigm thus bridges perception and reasoning, offering a substrate for neural-symbolic fusion, generative modeling, interactive spatial computing, and robust, explainable planning in physically grounded environments.