VIGS-SLAM: 3D Gaussian Splatting SLAM

Updated 8 December 2025

The paper introduces VIGS-SLAM's main contribution: fusing visual and inertial sensor data with 3D Gaussian splatting for near-real-time, high-fidelity scene reconstruction and precise camera tracking.
VIGS-SLAM employs both loose and tight coupling methods, leveraging advanced sensor fusion and nonlinear optimization to handle motion blur, low texture, and rapid movement.
The system utilizes efficient GPU-enabled rendering and optimization techniques, outperforming state-of-the-art benchmarks in trajectory accuracy and visual quality.

VIGS-SLAM refers to a class of Simultaneous Localization and Mapping (SLAM) systems that leverage Visual-Inertial sensor fusion and 3D Gaussian Splatting (3DGS) for dense, photorealistic scene reconstruction and robust camera trajectory estimation. VIGS-SLAM tightly or loosely couples visual (typically RGB or RGB-D) and inertial (IMU) data streams within advanced optimization frameworks, aiming for near-real-time performance, high-fidelity mapping, and strong resilience under challenging real-world conditions such as motion blur, low texture, or rapid motion.

1. Foundations and Architectural Principles

VIGS-SLAM systems are constructed around the representation of scene geometry and appearance through a set of 3D Gaussian primitives. Each primitive is parameterized by center $\mu \in \mathbb{R}^3$ , covariance $\Sigma \in \mathbb{R}^{3 \times 3}$ (typically decomposed via $R \Lambda^2 R^\top$ ), color $c \in \mathbb{R}^3$ , and opacity $o,\sigma \in [0,1]$ . These anisotropic Gaussians collectively support differentiable rasterization, enabling fast, realistic rendering and backpropagation-based map refinement, similar to NeRFs but with greatly reduced computational burden and superior efficiency for SLAM scenarios (Zhu et al., 2 Dec 2025).

The incoming visual stream (from monocular, stereo, or RGB-D sensors) supplies dense photometric information, while the IMU stream provides high-frequency orientation and acceleration data. Sensor synchronization is precise, with hardware-level timestamp alignment and known extrinsic calibration between camera and IMU. Key assumptions include fixed extrinsic parameters and sufficient signal overlap between visual and inertial domains.

2. Visual-Inertial Sensor Fusion and Tracking

VIGS-SLAM systems fuse visual and inertial modalities using either loose or tight coupling. Loosely coupled systems (e.g., (Pak et al., 23 Jan 2025)) use IMU preintegration to yield a high-rate initial pose prediction that seeds visual–geometric alignment via Generalized Iterative Closest Point (GICP) on depth-derived Gaussian point clouds. This strategy ensures rapid convergence even under rapid motion, as the IMU provides an inertial prior that mitigates visual degeneracy.

Tightly coupled systems (e.g., (Zhu et al., 2 Dec 2025)) embed visual and inertial measurement residuals in a single nonlinear optimization. Here, a joint cost function over camera poses $T_i$ , per-pixel disparities $d_i$ , velocities $v_i$ , and IMU biases $b_i$ is minimized:

$\min_{\{T_i, d_i, v_i, b_i\}} \sum_{(i,j)\in \mathcal{E}} \rho( E_{\text{vis},ij} + E_{\text{iner},ij}) + \dots$

Visual residuals ( $E_{\text{vis}}$ ) comprise dense photometric errors with learned correspondences (often predicted by recurrent ConvGRU modules as in DROID-SLAM), while inertial residuals ( $E_{\text{iner}}$ ) enforce consistency with IMU-preintegrated rotation, position, velocity, and bias increments. Both are weighted by their respective covariance estimates. This approach yields robust real-time tracking even under challenging lighting or dynamic conditions, and enables accurate metric scale recovery, time-varying bias handling, and global pose consistency when integrating loop closure constraints.

3. Mapping and 3D Gaussian Splatting

The mapping module maintains a global set $\mathcal{G}$ of anisotropic Gaussians, with each keyframe seeding new splats from its unprojected depth/disparity observations. Initially, $\mu$ is set at the back-projected camera points; $\Sigma$ is isotropic or determined by local point cloud geometry (e.g., via $k$ -NN); $c$ is taken from image RGB; $o$ initialized (e.g., $o=0.5$ ). Upon insertion, Gaussian parameters are refined via backpropagation on a differentiable rendering loss:

$L = \lambda_{d} \| \hat{D} - D \|_1 + \lambda_{c} \| \hat{I} - I \|_1 + \lambda_{\text{iso}} \mathcal{L}_{\text{iso}}$

where $\mathcal{L}_{\text{iso}}$ regularizes covariance elongation. Rendering uses alpha-blending along projected rays:

$C_p = \sum_{m=1}^N c_m \alpha_m \prod_{n < m} (1 - \alpha_n)$

to produce per-pixel color $C_p$ and opacity $O_p$ . Efficient CUDA kernels and hash-indexed data structures ensure map scalability and fast ray-marching even with $10^4$ – $10^5$ Gaussians (Zhu et al., 2 Dec 2025).

Keyframe selection is performed via optical flow, coverage thresholds, or overlap metrics, ensuring efficient map growth and redundancy suppression.

4. Optimization and Loop Closure

Local optimization includes bundle adjustment over a sliding window (typically $\sim$ 10 keyframes), jointly refining all state variables, including visual-inertial parameters and Gaussian map properties. For globally consistent mapping, VIGS-SLAM employs a global pose graph $\mathcal{E}^+$ , augmented with loop edges $\mathcal{E}^*$ detected by optical-flow and orientation criteria.

After pose-graph BA (typically in Sim(3)), map consistency is restored by updating the splats seeded from each keyframe. This is accomplished via a batched rigid (and if necessary, scale) transformation:

$x_{\text{loc}} = R_k^{-T}(\mu_i^- - t_k^-), \quad x'_{\text{loc}} = \delta s_k x_{\text{loc}}, \quad \mu_i^+ = R_k^+ x'_{\text{loc}} + t_k^+, \quad \Sigma_i^+ = R_k^+ (\delta s_k^2 R_k^{-T} \Sigma_i^- R_k^-) R_k^{+T}$

avoiding full re-optimization of the map after loop closure corrections (Zhu et al., 2 Dec 2025). This approach enables real-time, globally consistent mapping and trajectory estimation in large-scale and revisited environments.

5. IMU Preintegration and Bias Handling

IMU preintegration follows a discrete-time propagation model that accumulates linear acceleration $a_t$ and angular velocity $\omega_t$ between keyframes, accounting for known gravity and estimated biases:

$\begin{align*} p_{t+1} &= p_t + v_t \Delta t + \frac{1}{2}(a_t - R_t^Tg - b_a - n_a)\Delta t^2 \ v_{t+1} &= v_t + (a_t - R_t^Tg - b_a - n_a)\Delta t \ R_{t+1} &= R_t \cdot \exp\left[(\omega_t - b_\omega - n_\omega)\Delta t\right] \end{align*}$

Preintegration outputs are used for initial pose guesses and as residual constraints in joint optimization.

Recent advances incorporate time-varying bias modeling, per-keyframe bias variables $b_i$ , and penalty terms $r_{\text{bias}} = b_j - b_i$ to enforce slow drift and regularity (Zhu et al., 2 Dec 2025). A robust three-stage initialization—pure vision, inertial-only, and joint optimization—guarantees metric scale recovery and stable VIO state estimation.

6. Experimental Evaluation and Performance

Comprehensive results across indoor/outdoor and low-/high-texture datasets demonstrate that VIGS-SLAM consistently achieves or surpasses state-of-the-art benchmarks for both trajectory and rendering quality:

Dataset	VIGS-SLAM ATE RMSE (cm)	Reference Method(s)
EuRoC	2.79	HI-SLAM2: 2.94; DROID: 18.60
RPNG	1.68	VINS-Mono: 3.46; ORB-SLAM3: 4.30
UTMM	3.24	HI-SLAM2: 8.10; DROID: 7.91
FAST-LIVO2	6.08	HI-SLAM2: 10.22

Rendering quality metrics (averaged PSNR/SSIM/LPIPS) consistently exceed previous 3DGS SLAM variants:

Method	PSNR↑	SSIM↑	LPIPS↓
Splat-SLAM	17.32	0.543	0.465
HI-SLAM2	21.12	0.685	0.358
VIGS-SLAM	22.21	0.723	0.314

VIGS-SLAM demonstrates reliable tracking in the presence of blur, low texture, and rapid exposure changes. It maintains real-time throughput on high-end GPUs ( $\sim$ 15 fps tracking, $\sim$ 7 fps mapping for $480 \times 480$ images, $N \sim 10,000$ Gaussians) (Zhu et al., 2 Dec 2025).

7. Limitations and Prospective Directions

Current VIGS-SLAM frameworks support only monocular pinhole cameras; extension to stereo, RGB-D, or fisheye is a logical direction. While mapping kernels and real-time tracking rely on GPU acceleration, not all IMU routines are ported to CUDA, leaving some overhead. Large-scale runs (especially with loosely coupled schemes) can accumulate low-speed drift in the absence of frequent keyframes or global loop closure. Dynamic objects may still introduce mapping and tracking outliers, despite GICP/IMU initialization. Ongoing research aims to:

Tightly integrate visual, geometric (ICP), and inertial residuals in unified optimization (Pak et al., 23 Jan 2025).
Incorporate place recognition-based loop-closing and global pose graph optimization.
Extend beyond room- or hall-scale to outdoor, multi-agent, or LIDAR-enhanced settings.
Implement efficient, GPU-native IMU preintegration and bundle adjustment.
Probe learning-based splat pruning or saliency-driven resource allocation.

The empirical superiority of VIGS-SLAM in robustness, accuracy, and fidelity, across multiple datasets and system configurations, indicates its role as a state-of-the-art backbone for visual-inertial dense SLAM in complex and variable environments (Zhu et al., 2 Dec 2025, Pak et al., 23 Jan 2025).