3D Semantic Occupancy Graphs
- 3D semantic occupancy graphs are data structures that discretize space into nodes annotated with occupancy probabilities and semantic class distributions.
- Recent advances transition from dense voxel grids to sparse, graph-based representations using deep learning, probabilistic models, and transformer architectures.
- They enable precise scene reasoning for applications such as autonomous driving, robotic navigation, and open-vocabulary scene understanding.
A 3D semantic occupancy graph is a scene representation in which spatial regions are discretized volumetrically or hierarchically, annotated with both occupancy and category labels, and encoded in a data structure combining the topological properties of a graph with rich node and edge semantics. These graphs are central to contemporary research in embodied perception, autonomous driving, and scene understanding, enabling precise per-voxel or per-object reasoning while supporting downstream tasks such as navigation, path planning, and retrieval. Recent methodological advances have transitioned from dense voxel grids to adaptive, sparse, and graph-based data structures, incorporating deep learning, probabilistic modeling, and transformers for high fidelity and computational efficiency.
1. Foundations and Definitions
3D semantic occupancy graphs assign to each element of a spatial partition (voxel, Gaussian, octree cell, or instance) both an occupancy probability and a probability distribution over semantic classes. These elements become graph nodes, and are connected by edges reflecting geometric adjacency, semantic affinity, or spatial relationship. Node features typically include position, geometry, semantic representation, and class probabilities; edges may encode distance, direction, or learned relations. Compared to classical point clouds or voxel grids, this representation enables expressive modeling of both free and occupied space, boundary details, and object relationships. Notably, occupancy graphs decouple storage efficiency from spatial resolution by leveraging adaptive or sparse structures, facilitating real-time use in large-scale or open-vocabulary settings (Wang et al., 2024, Song et al., 13 Jun 2025, Tang et al., 2024, Yang et al., 2017).
2. Structural Representations and Node Construction
The construction of 3D semantic occupancy graphs differs by application and data modality. Four dominant paradigms are observed in recent literature:
- Dense Voxel Grids (+CRF): Grids centered at the agent are indexed in 3D, with per-voxel occupancy and class label probabilities updated incrementally using image priors, and organized into a graph for conditional random field (CRF) optimization (Yang et al., 2017).
- Sparse Voxel Graphs: Sparse COO representations retain only nonzero-occupancy voxels as nodes, each with position, features, occupancy, and semantic probabilities; adjacency is defined via spatial proximity (e.g., 6- or 26-connectivity), and edge weights depend on geometric and occupancy features (Tang et al., 2024).
- Gaussian Splatting Graphs: Learnable Gaussians parameterize local spatial extents, positions, and semantic features. Two graphs are built: a geometric graph via spatially adaptive radii (scaled by covariance), and a semantic graph based on top-M cosine similarities in feature space; graph transformer layers refine the set (Song et al., 13 Jun 2025).
- Adaptive Octree-Graphs: Object instances are represented as root-centered octrees adaptively subdivided to match geometric complexity. Each octree acts as a node, storing a single semantic descriptor derived from fused vision-language features; edges encode spatial adjacency and directional relations (Wang et al., 2024).
The table below summarizes core node types:
| Approach | Node Definition | Node Features |
|---|---|---|
| Voxel-CRF (Yang et al., 2017) | Voxel (x, y, z) | Log-odds, class probs, color |
| SparseOcc (Tang et al., 2024) | Active voxel (pᵢ) | Feature fᵢ, occupancy oᵢ, semantic probs sᵢ |
| GraphGSOcc (Song et al., 13 Jun 2025) | Gaussian (μᵢ, Σᵢ) | Position, extent, opacity, semantic vector, hᶠᵢ |
| Octree-Graph (Wang et al., 2024) | Adaptive octree per object | Octree, center, semantic feature (IFA-aggregated) |
3. Edge Construction and Graph Topology
Edge definition shapes the expressive power and sparsity of the resulting graph:
- Spatial Adjacency: In voxel approaches, connect each node to immediate neighbors (Chebyshev or Euclidean distance ≤ 1) for local region continuity (Tang et al., 2024, Yang et al., 2017).
- Geometric Proximity: In Gaussian graphs, nodes are connected if Euclidean distance is below radius rᵢ adaptively set by covariance, optionally KNN-constrained (Song et al., 13 Jun 2025).
- Semantic Similarity: For fine semantic relationships, connect nodes if their semantic feature vectors yield top-M cosine similarity (Song et al., 13 Jun 2025).
- Object Relation: In instance- or octree-based graphs, edges link nodes whose centers lie within a threshold radius, and store distance, direction, and qualitative spatial relations (e.g., left, behind) (Wang et al., 2024).
These multi-criteria edge sets enable both local contextual reasoning and long-range semantic unification, as leveraged in dual-graph transformer architectures.
4. Graph-Based Neural Architectures
Semantic occupancy graphs serve as substrates for advanced neural network models, particularly transformer-based and CRF-based architectures:
- Graph Transformers with Dual Graphs: GraphGSOcc employs layers that alternately attend over geometric and semantic graphs, fusing their updates adaptively. Coarse-grained attention in lower layers captures object-level topology; finer-grained attention in higher layers sharpens boundaries (Song et al., 13 Jun 2025).
- Dynamic-static decoupling optimizes context propagation separately for movable and static objects, with independent projection weights and adaptive fusion per node.
- Sparse Transformers: SparseOcc decodes only nonzero voxels with a sparse transformer head, using trilinear feature interpolation and efficient attention masking, avoiding computation on empty space (Tang et al., 2024).
- High-order Graphical Models: CRF-based approaches define unary, pairwise, and high-order clique potentials over the voxel graph, with efficient mean-field inference and pairwise terms reflecting both feature and geometric compatibility (Yang et al., 2017).
- Instance-level Aggregation: Octree-Graph aggregates multi-view vision-language features at the instance level using self-attention and clustering, assigning this descriptor to the corresponding adaptive octree node (Wang et al., 2024).
A plausible implication is that graph-based neural modules can exploit rich geometric and semantic structure without incurring dense computation, facilitating higher fidelity at lower memory and compute budgets.
5. Training, Inference, and Decoding
Supervision and inference in 3D semantic occupancy graphs proceed in a manner closely tied to the underlying representation:
- Gaussian Splatting: After graph-transformed embedding refinement, occupancy probabilities for each class at a voxel are obtained by splatting the Gaussian density weighted by semantic logits:
where (Song et al., 13 Jun 2025).
- Sparse Voxel Decoding: SparseOcc scatters mask logits onto active voxel coordinates, aggregates via learned multi-scale interpolation, and classifies via a sparse transformer (Tang et al., 2024).
- CRF Graphs: Unary terms from 2D CNN segmentation guide label propagation; pairwise/high-order CRF terms penalize spatial, appearance, and instance-inconsistent assignments. Efficient mean-field updates implement iterative refinement (Yang et al., 2017).
- Octree Path Queries: Occupancy is queried per octree node or recursively within leaves. Semantic retrieval leverages learned descriptors and spatial relations (Wang et al., 2024).
Loss functions commonly include standard cross-entropy per voxel or node, with optional terms for geometric regularization, temporal consistency, and covariance penalty.
6. Empirical Performance and Benchmarks
Recent advancements demonstrate state-of-the-art accuracy and substantial efficiency gains by leveraging 3D semantic occupancy graphs:
- GraphGSOcc: On SurroundOcc-nuScenes (17 classes, voxel size 0.4 m³), achieves mIoU (mean IoU), +1.97% mIoU over GaussianWorld, memory usage of 6.1 GB (–13.7%) (Song et al., 13 Jun 2025).
- SparseOcc: On nuScenes-Occupancy, increases mIoU to (+1.3% over dense baseline) and reduces FLOPs by . On SemanticKITTI, similar accuracy at half the computation (Tang et al., 2024).
- Octree-Graph: On ScanNet, boosts zero-shot 3D semantic segmentation mIoU by up to +17.1% and path-planning success at 0.25 m endpoints to 97%. Storage is reduced to <0.1 MB per object, versus >6 MB for point clouds (Wang et al., 2024).
- Voxel-CRF: On KITTI, hierarchical CRF over the voxel graph improves mean IoU by 15% over prior 3D systems, with mean-field updates feasible in near-real time on -voxel maps (Yang et al., 2017).
Ablative studies confirm the value of dual graph attention, multi-scale transformer hierarchies, and high-order CRF terms for both accuracy gains and robustness.
7. Applications and Significance
3D semantic occupancy graphs are integral to multiple downstream applications:
- Autonomous Driving: Provide explicit modeling of free and occupied space, essential for navigation, collision avoidance, and trajectory planning (Song et al., 13 Jun 2025, Tang et al., 2024).
- Robotic Navigation: Support efficient incremental mapping, semantic path planning, and dynamic/static object decoupling (Yang et al., 2017).
- Open-vocabulary Scene Understanding: Enable zero-shot semantic retrieval, object-centric queries, and language–vision integration in complex environments (Wang et al., 2024).
- Graph-based Reasoning: Facilitate efficient neural and probabilistic operations via GNNs, transformer heads, or CRFs, leveraging the structured graph substrate for high-level reasoning and generalization.
The recurring trend is a shift from dense, monolithic grids to lightweight, adaptive, and semantically expressive graph-based representations—balancing memory, accuracy, and versatility for a broad range of scene understanding tasks.