GS3LAM: Gaussian Semantic Splatting SLAM

Published 29 Mar 2026 in cs.CV | (2603.27781v1)

Abstract: Recently, the multi-modal fusion of RGB, depth, and semantics has shown great potential in dense Simultaneous Localization and Mapping (SLAM). However, a prerequisite for generating consistent semantic maps is the availability of dense, efficient, and scalable scene representations. Existing semantic SLAM systems based on explicit representations are often limited by resolution and an inability to predict unknown areas. Conversely, implicit representations typically rely on time-consuming ray tracing, failing to meet real-time requirements. Fortunately, 3D Gaussian Splatting (3DGS) has emerged as a promising representation that combines the efficiency of point-based methods with the continuity of geometric structures. To this end, we propose GS3LAM, a Gaussian Semantic Splatting SLAM framework that processes multimodal data to render consistent, dense semantic maps in real-time. GS3LAM models the scene as a Semantic Gaussian Field (SG-Field) and jointly optimizes camera poses and the field via multimodal error constraints. Furthermore, a Depth-adaptive Scale Regularization (DSR) scheme is introduced to resolve misalignments between scale-invariant Gaussians and geometric surfaces. To mitigate catastrophic forgetting, we propose a Random Sampling-based Keyframe Mapping (RSKM) strategy, which demonstrates superior performance over common local covisibility optimization methods. Extensive experiments on benchmark datasets show that GS3LAM achieves increased tracking robustness, superior rendering quality, and enhanced semantic precision compared to state-of-the-art methods. Source code is available at https://github.com/lif314/GS3LAM.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces a Semantic Gaussian Field that models scene elements as 3D Gaussians, enabling efficient joint geometric and semantic optimization.
It employs Depth-Adaptive Scale Regularization (DSR) to sharpen semantic boundaries and Random Sampling-Based Keyframe Mapping (RSKM) to reduce optimization bias, both enhancing map fidelity.
Empirical results on benchmarks showcase GS3LAM’s superior PSNR, mIoU, and FPS performance compared to NeRF and other 3DGS-based SLAM methods.

GS³LAM: A Detailed Analysis of Gaussian Semantic Splatting SLAM

Introduction and Motivation

GS³LAM introduces a dense semantic SLAM framework integrating multimodal fusion of RGB, depth, and semantic cues with the recently emergent 3D Gaussian Splatting (3DGS) scene representations. Prior approaches employing explicit representations (points, surfels, meshes) suffer from limited spatial resolution and inability to generalize to unknown regions, precluding dense semantic mapping. NeRF-based implicit methods improve continuity but incur prohibitive computational latency due to volume rendering bottlenecks, impeding real-time SLAM deployment. GS³LAM leverages the efficiency, locality, and modifiability of 3DGS while maintaining geometric-semantic consistency and scalability, targeting robust, real-time, and high-fidelity dense semantic mapping.

Figure 1: Framework overview of GS³LAM illustrating SG-Field modeling, adaptive Gaussian expansion, RSKM, and DSR integration.

Semantic Gaussian Field (SG-Field)

GS³LAM models the scene as a Semantic Gaussian Field (SG-Field), encoding each scene element as a parameterized 3D Gaussian endowed with spatial, appearance, and low-dimensional implicit semantic features. The semantic features undergo decoding via a lightweight CNN, efficiently mapping to categorical segmentation labels. The design enables fast bidirectional transformation between 3D features and 2D semantic labels, optimizing both geometry and semantics simultaneously.

Splatting-based differentiable rendering is central: RGB, depth, and semantic features are projected and $\alpha$ -blended using front-to-back ordering and pixel-wise accumulation. This procedure supports joint optimization of camera poses and field attributes under appearance, geometric, and semantic losses with tractable gradient flow.

Depth-Adaptive Scale Regularization (DSR)

The SG-Field suffers from scale misalignment where irregular Gaussian variances degrade geometric surface fidelity, notably at semantic boundaries. DSR constrains Gaussian scales within a depth-dependent interval using statistically derived thresholds ( $\mu_s \pm 2\sigma_s$ ), reducing boundary blur, enhancing surface-edge sharpness, and enforcing explicit spatial alignment of geometry and semantics. Ablations confirm DSR’s contribution to improved PSNR, mIoU, and tracking precision.

Figure 2: DSR ablation highlights reduction of blurring and improved spatial boundary precision.

Random Sampling-Based Keyframe Mapping (RSKM)

Incremental optimization in 3DGS-based SLAM often exhibits catastrophic forgetting: co-visible regions are overfit while sparsely observed areas are under-optimized, resulting in optimization bias and map inconsistency. Standard Local Covisibility Keyframe Mapping (LCKM) introduces high variance in PSNR and spatial reliability. RSKM, a probability-weighted random sampling strategy, improves global convergence, yields higher mean PSNR, and drastically lowers PSNR variance. Empirically, RSKM mitigates forgetting and fosters consistent semantic-geometric reconstruction across all viewpoints.

Figure 3: Optimization bias on Replica "Office 3", contrasting LCKM and RSKM strategies; RSKM reduces bias and enhances global map fidelity.

Figure 4: RSKM ablation visualizing PSNR increases and variance reduction, confirming higher rendering consistency.

Rendering, Tracking, and Semantic Performance

GS³LAM achieves strong numerical results across benchmarks. On Replica, GS³LAM achieves an average PSNR of 36.26 dB, SSIM of 0.989, LPIPS of 0.052, outperforming all prior NeRF- and 3DGS-based methods, with notably higher accuracy for boundary regions and edge rendering. Semantic reconstruction attains mIoU of 96.63%, outperforming SNI-SLAM, DNS-SLAM, NIDS-SLAM, and concurrent SGS-SLAM, SemGauss-SLAM, and NEDS-SLAM by up to 9.22%. Tracking errors (ATE RMSE) are competitive, with only minor degradation attributed to semantic-focused field optimization.

Figure 5: Qualitative comparison with SOTA methods on virtual Replica scenes illustrating GS³LAM's accurate geometry, semantics, and appearance.

GS³LAM demonstrates real-time rendering throughput: 109.12 FPS on $1200 \times 680$ resolution for Replica and 499.78 FPS on ScanNet ( $640 \times 480$ ), vastly exceeding NeRF-based approaches and maintaining superior spatial consistency (see Table data in the paper).

Figure 6: Semantic Gaussian fields constructed by GS³LAM, visualizing robust tracking and high-fidelity rendering.

Figure 7: Decoupled semantic, geometric, and appearance maps from GS³LAM, evidencing consistency for downstream real-time tasks.

Theoretical Contributions and Optimization Analysis

GS³LAM provides explicit analytic Jacobians for camera pose optimization, leveraging chain rule differentiation through SG-Field projections, covariance updates, and splatting accumulations. Optimization bias assessment demonstrates the influence of sampling strategies on global map consistency; empirical results substantiate the superiority of RSKM in both local convergence and global variance minimization.

Practical and Theoretical Implications

GS³LAM advances semantic SLAM by unifying real-time renderability, scene continuity, and semantic-geometric alignment. The SG-Field structure is amenable to efficient expansion, pruning, and multimodal feature fusion, supporting scalable deployment in robotics, AR/VR, and autonomous navigation. The framework exposes avenues for further research: integration of higher-level features (e.g., CLIP or DINOv2), optimization of Gaussian attributes via neural MLPs for robustness to real-world sensor noise, and open-vocabulary semantic mapping via language embeddings.

Future Directions

Continued research may focus on:

Generalization to Open-Vocabulary Semantics: Embedding large-scale language features for zero-shot semantic segmentation.
Hybrid Neural-Explicit Representations: Incorporating neural point attributes for real-world sensor robustness.
Memory-Efficient Incremental Mapping: Dynamic memory allocation and field pruning for lifelong SLAM.
Uncertainty Quantification: Gaussian variance modeling for uncertainty-aware semantic reconstruction.

Conclusion

GS³LAM establishes an efficient, scalable, and accurate dense semantic SLAM system, leveraging the strengths of Gaussian Splatting for joint geometric-semantic optimization. The introduced SG-Field, DSR, and RSKM mechanisms contribute measurably to spatial fidelity and global rendering consistency. Empirical evaluations confirm its advancement over prior NeRF- and 3DGS-based SLAM frameworks. GS³LAM positions itself as a viable foundation for real-time multimodal mapping and as a launchpad for subsequent developments in semantic SLAM and 3D scene understanding.

Markdown Report Issue