OpenGS-SLAM: 3D Gaussian-based SLAM

Updated 6 March 2026

OpenGS-SLAM is a dense SLAM system that represents scenes with anisotropic 3D Gaussians, enabling real-time tracking and high-quality image rendering.
It optimizes scene parameters through differentiable splatting, integrating photometric, geometric, and semantic information for enhanced mapping accuracy.
The framework extends to visual-inertial and monocular setups, demonstrating state-of-the-art performance on diverse benchmarks in both geometric and semantic domains.

OpenGS-SLAM refers to a class of dense SLAM (Simultaneous Localization and Mapping) systems based on 3D Gaussian Splatting (3DGS) that achieve robust, real-time tracking, high-fidelity rendering, and—in some variants—open-vocabulary semantic mapping in diverse environments. These systems explicitly represent the scene as a set of anisotropic 3D Gaussians and optimize this representation using differentiable splatting renderers. OpenGS-SLAM methods demonstrate state-of-the-art performance on both geometric and semantic SLAM tasks, offer efficient open-source implementations, and are extensible to a wide range of visual and multi-modal settings, including outdoor, monocular, and visual-inertial inputs (Yu et al., 21 Feb 2025, Yang et al., 3 Mar 2025, Yugay et al., 2023, Zhu et al., 2 Dec 2025, Yoo et al., 9 Dec 2025, Yan et al., 2023).

1. Scene Representation via 3D Gaussian Splatting

OpenGS-SLAM systems represent the environment as a set of $N$ anisotropic 3D Gaussian primitives:

$G_i = (\mu_i \in \mathbb{R}^3,\, \Sigma_i \in \mathbb{R}^{3 \times 3},\, o_i \in [0,1],\, c_i \in \mathbb{R}^3)$

where $\mu_i$ is the Gaussian mean (position), $\Sigma_i$ its covariance (shape/orientation/scale), $o_i$ an opacity or density weight, and $c_i$ the RGB color. Advanced systems (e.g., OpenGS-SLAM for open-set semantics) attach explicit integer-valued semantic labels $\ell_i$ to each Gaussian, supporting label-aware rendering and real-time semantic updating (Yang et al., 3 Mar 2025).

Rendering is performed by projecting each Gaussian to the image plane as a 2D ellipse and accumulating its contribution using front-to-back $\alpha$ -blending:

$C(p) = \sum_{i \in N_p} c_i\,\alpha_i\,\prod_{j<i}(1-\alpha_j)$

where $\alpha_i$ is the per-Gaussian opacity at pixel $p$ , and $N_p$ the sorted list of overlapping Gaussians. The color, depth, and (for semantics) per-pixel label maps are computed in this compositing pipeline. Analytical gradients of the renderer allow for end-to-end differentiable optimization of both scene parameters and camera poses (Yan et al., 2023, Yu et al., 21 Feb 2025).

2. Core System Architecture and Pipeline

The canonical OpenGS-SLAM pipeline integrates several core modules:

Tracking: Estimation of camera pose for each frame via direct photometric/geometric residuals between observed and rendered images from the current Gaussian map. Methods include coarse-to-fine optimization and, for RGB-only SLAM, learning-based pointmap initialization followed by PnP+RANSAC and differentiable photometric refinement (Yu et al., 21 Feb 2025, Yan et al., 2023).
Mapping: When a new keyframe is detected, new Gaussians are seeded adaptively at unexplained (low-coverage) regions or where the current map is inconsistent with observations. Existing Gaussians may be pruned if they diverge from new geometry (Yugay et al., 2023, Yan et al., 2023).
Optimization: Joint or local bundle adjustment optimizes Gaussian parameters and camera poses, usually via a sum of rendering-based photometric, geometric, and regularization losses.
Label Fusion (semantic variants): Explicit 2D-to-3D label propagation and consensus, leveraging foundational models (e.g., RAM, YOLO-World, SAM) for open-set semantics (Yang et al., 3 Mar 2025, Yoo et al., 9 Dec 2025).

The architectural foundation is extensible: for instance, visual-inertial OpenGS-SLAM pipelines integrate IMU preintegration, pose-velocity-bias optimization, and loop-closure pose-graph optimization with consistent Gaussian coordinate updates (Zhu et al., 2 Dec 2025).

3. Key Algorithmic Modules

Module	Functionality	Notable Techniques
Pointmap Regression Network	Learns per-frame dense 3D pointmaps for pose estimation in RGB-only SLAM	Vision Transformer encoder/decoder, cross-attention (Yu et al., 21 Feb 2025)
Adaptive Scale Mapper	Recovers metric scale in monocular or scale-drifting pointmaps	Triple-frame matching, geometric averaging (Yu et al., 21 Feb 2025)
Gaussian Voting Splatting	Propagates 2D label assignments incrementally into the 3D map	Top-K splat indexing, label voting (non-differentiable) (Yang et al., 3 Mar 2025)
Confidence-Based Label Consensus	Cross-view semantic label fusion using area-weighted confidence and consensus	Match/split/decay rules, partial/whole matching, confidence thresholds (Yang et al., 3 Mar 2025)
Segmentation Counter Pruning	Removes erroneous or overgrown Gaussians from ambiguous regions	Eigenvalue thresholding on covariance, semantic disagreement (Yang et al., 3 Mar 2025)
Sub-map Partitioning	Divides maps for memory/compute scaling	Active/frozen sub-map switching, meshing on closure (Yugay et al., 2023)
Visual-Inertial Joint Optimization	Fuses visual and inertial data for robust tracking	Sliding-window BA, time-varying bias, Sim(3) loop correction (Zhu et al., 2 Dec 2025)
Monocular Open-Set Semantics	Attaches open-vocabulary semantic features to Gaussians without 3D supervision	Visual foundation models (MASt3R, SAM, CLIP), memory bank (Yoo et al., 9 Dec 2025)

Each module has precise mathematical and algorithmic specifications as described in the corresponding papers.

4. Semantic Mapping and Open-Set Scene Understanding

OpenGS-SLAM advances open-set semantic SLAM via explicit per-Gaussian labeling, leveraging 2D foundational models for category-agnostic instance segmentation and unambiguous object-level parsing. The Gaussian Voting Splatting process updates labels on the 3D map in real time by assigning and fusing 2D semantic labels with confidence-based rules, avoiding the constraints of closed-set classifiers (Yang et al., 3 Mar 2025).

Semantic Consensus: Merges multi-view predictions by dynamically weighting confidences and decaying fragmentary labels, preventing label fragmentation and over-segmentation.
Pruning and Storage: Eigenvalue-based counter-splat pruning refines object boundaries and lowers storage cost by preventing “swelling” of Gaussians into ambiguous spaces.
Open-vocabulary Segmentation: Extensions such as OpenMonoGS-SLAM integrate memory-optimized CLIP embeddings and multi-scale mask fusion to achieve prompt-driven, open-set segmentation from monocular input without 3D semantic ground-truth (Yoo et al., 9 Dec 2025).

Reported performance includes mIoU ≈ 62% (open-set, SAM-1.0), ≈2× lower memory, >10× faster label rendering (165 FPS), and SOTA novel-view segmentation (Yang et al., 3 Mar 2025).

5. Robustness Extensions: Monocular, RGB-Only, and Visual-Inertial SLAM

RGB-Only/Outdoor: OpenGS-SLAM for unbounded outdoor scenes eliminates reliance on depth input by employing a pointmap regression network and adaptive scale mapping. Joint optimization achieves ATE RMSE = 0.839 m and reduces tracking error to 9.8% of previous 3DGS baselines on the Waymo dataset, with significant improvements in PSNR and SSIM (Yu et al., 21 Feb 2025).
Monocular Open-Set Semantics: OpenMonoGS-SLAM unifies monocular 3DGS SLAM and open-set semantic features via visual foundation models. A memory mechanism allows efficient, on-the-fly language-driven semantic fusion. Achieved results include closed-set mIoU = 0.896, prompt-driven open-set IoU = 0.845, PSNR = 34.47, ATE RMSE = 1.6 cm on Replica, all without depth labels (Yoo et al., 9 Dec 2025).
Visual-Inertial: VIGS-SLAM achieves robust tracking under challenging conditions (motion blur, low texture) by integrating IMU preintegration, bias modeling, and visual-inertial optimization. It maintains ATE RMSE of 1.68–6.08 cm across multiple datasets, surpassing state-of-the-art visual-inertial SLAM systems (Zhu et al., 2 Dec 2025).

6. Benchmark Results and Implementation Characteristics

OpenGS-SLAM systems have been extensively benchmarked:

Geometric Accuracy: Replica (ATE 0.16 cm (Yang et al., 3 Mar 2025), 0.31 cm (Yugay et al., 2023)), TUM RGB-D (ATE 2.15–2.9 cm (Yang et al., 3 Mar 2025, Yugay et al., 2023, Zhu et al., 2 Dec 2025)), Waymo (ATE 0.839 m (Yu et al., 21 Feb 2025)).
Photorealistic Rendering: PSNR up to 39.5 dB (Replica), LPIPS ≈ 0.034, rendering speed >30 FPS (photo-realistic) and >2000 FPS (color-only) (Yugay et al., 2023, Yang et al., 3 Mar 2025).
Semantic Segmentation: Open-set mIoU up to 62% (SAM-1.0), close-set up to 94%.
Efficiency and Storage: 165 FPS semantic rendering at 302 MB (OpenGS-SLAM, Replica), 2× lower storage cost than prior methods.
Open Source Support: CUDA/PyTorch-based implementations, CMake/Docker packaging, ROS integration, and support for dataset plugins and development via universal interfaces (Zhao et al., 2019, Yugay et al., 2023).

7. Limitations and Prospects

OpenGS-SLAM systems, as described in current literature, assume static scenes and rely on the capabilities of off-the-shelf 2D models for semantic processing. They do not natively address dynamic scene elements, and the precision of semantic mapping is bounded by foundational model performance. Key future directions include per-object motion segmentation for dynamics, deeper integration of learned feature embeddings, and creation of large-scale, open-set RGB-D datasets incorporating rare categories for robust generalization (Yang et al., 3 Mar 2025, Yoo et al., 9 Dec 2025).

In sum, OpenGS-SLAM formalizes a paradigm in SLAM that combines the efficiency, scalability, and photorealism of 3D Gaussian Splatting with extensibility to semantic, visual-inertial, and open-set contexts. It sets new standards on geometric, appearance, and semantic SLAM benchmarks while remaining accessible to the research community via open, modular frameworks.