- The paper introduces an EM-based joint optimization framework coupling DSO with 2D Gaussian splatting to achieve real-time, accurate dense mapping.
- It presents a novel, gradient-driven Gaussian splat initialization that reduces convergence time and enhances geometric consistency.
- Experimental results show state-of-the-art photometric and geometric performance at 30 FPS with robust tracking under challenging conditions.
GSO-SLAM: Bidirectionally Coupled Gaussian Splatting and Direct Visual Odometry
Introduction and Context
Dense monocular SLAM remains a critical challenge for robotics and AR/VR, balancing fast and robust pose tracking with high-fidelity, photorealistic reconstruction. Conventional explicit representations such as surfels and SDFs enable real-time, but lack flexible, high-quality rendering. While recent INR-based systems like NeRF-SLAM and iMAP achieve visually impressive results, their reliance on neural rendering significantly limits real-time feasibility. In contrast, Gaussian Splatting (GS) provides GPU-accelerated, explicit scene representation with favorable rendering/optimization characteristics, leading to a recent trend of GS-based SLAM pipelines. However, existing frameworks either tightly couple tracking and mapping—resulting in high redundancy and computational burden—or decouple them, which sacrifices joint geometric consistency and optimality.
GSO-SLAM introduces a bidirectionally coupled SLAM framework integrating Direct Sparse Odometry (DSO) with 2D Gaussian Splatting-based dense mapping. The system leverages an EM-based joint optimization paradigm along with a novel, gradient-driven Gaussian Splat Initialization, enabling simultaneous refinement of poses, depths, and scene geometry while maintaining real-time constraints.
Figure 1: Comparison of reconstructed dense 3D scene with depth and RGB renderings, demonstrating the system’s geometric accuracy and photometric fidelity.
System Architecture and Methodology
Bidirectional Coupling with EM-based Joint Optimization
The core contribution is formulating pose, depth, and scene optimization as a coupled EM process. In the E-step, 2D Gaussian parameters (means, covariances, opacities, colors) are updated to minimize photometric and geometric losses using current depth/pose estimates from DSO. Specifically, rendered images and depths from GS are compared against observations and DSO semi-dense maps, enforcing both RGB consistency and scene regularity (normal constraints). Depth and normal consistency losses reinforce joint exploration of pose-geometry solutions.
In the M-step, camera poses and keyframe depths are updated, leveraging rendered scene predictions from GS as priors, within a DSO-style optimization energy. Importantly, redundant computation is avoided—depths become pseudo-observations, accelerating optimization compared to classical coupled- or decoupled- setups.
Figure 2: Overview of the GSO-SLAM system—tracking, keyframe management, Gaussian Splat initialization, and joint EM-based optimization.
Novel Gaussian Splat Initialization
A persistent difficulty in GS-based systems is the initialization of Gaussians: naive approaches (KNN, constant spread) lead to slow convergence, suboptimal regularization, and geometric inaccuracies. GSO-SLAM exploits the image gradients and depth/geometric cues already estimated in DSO for initializing Gaussian parameters. For each scene point, covariance in the image plane is inferred from intensity/gradient distributions across multiple views and mathematically lifted to estimate the 3D covariance via a least squares approach. Corrections ensure positive-definite, well-conditioned covariances. The final parameters set the means and anisotropic shapes of Gaussians, regularized using eigenbasis scaling.
Figure 3: Three-stage Gaussian Splat Initialization: extracting 2D covariance from gradients, aggregating to 3D covariance, and parameterizing initial splats by eigen-decomposition.
Parallelization and Runtime Considerations
Tracking (DSO) and mapping (GS optimization) are executed in parallel threads. When a new keyframe is inserted, the EM framework triggers joint optimization, which incorporates the new keyframe's semi-dense map for rapid, accurate geometry update. The rendering (GS) module is accelerated using GPU rasterization, and windowed-BA strategies keep optimization costs bounded regardless of map scale. The result is a low-latency pipeline (30 FPS), robust to scene complexity and size.
Experimental Evaluation
Geometric and Photometric Fidelity
GSO-SLAM achieves state-of-the-art geometric and photometric results among monocular systems, frequently outperforming even RGB-D SLAM baselines. On the Replica dataset, the system yields the highest PSNR (34.48 dB), SSIM (0.943), lowest LPIPS (0.060), and best L1 depth error (8.1 cm) for monocular approaches. Real-time performance (30 Hz) is maintained, which is up to 36× faster than competing coupled approaches (MonoGS). Results on TUM-RGBD reflect strong resilience to scene noise and challenging motion, despite the fundamental noise susceptibility of direct methods.
Figure 4: Rendering results on the Replica dataset, showing superior geometric accuracy and photometric fidelity over competing SLAM approaches.
Robustness and Scalability
On the INS long-corridor suite, GSO-SLAM maintains low trajectory drift without relying on global loop closure (ATE RMSE as low as 0.47 m), while alternatives relying on separate mapping/tracking accumulate significant error and often lose tracking with out-of-distribution motion. The design prevents error propagation between tracking and mapping; errors introduced by challenging frames (e.g., motion blur, textureless regions) stay localized due to the EM update rules and splat regularization.
Figure 5: RGB renders from scenes captured by a quadrupedal robot, illustrating sharp, photorealistic reconstruction even with strong motion perturbations.
Ablation Studies
Ablation results consistently underline the critical role of both the EM-based joint optimization and the advanced Gaussian Splat Initialization. Joint optimization improves both mapping and tracking accuracy (PSNR, ATE RMSE, L1 error), and proper initialization drastically reduces convergence time—KNN or constant initialization typically requires thousands of additional iterations to achieve GSO-SLAM’s final quality.
Analysis and Implications
GSO-SLAM empirically validates that explicit, differentiable scene parameterizations such as 2DGS, if properly initialized and tightly integrated with visual odometry, can outperform both decoupled explicit and state-of-the-art INR-based approaches in terms of runtime, geometric fidelity, and render quality. The bidirectional EM-driven coupling avoids error decoupling and leverages all observations for optimization. The discrete-to-Gaussian initialization efficiently bridges sparse and dense mapping, a paradigm likely extensible to other splatting/point-based representations.
Theoretically, the EM framework for joint optimization can be generalized to further hybrid representations, e.g., neural-inpainted splats, or scalable multi-agent SLAM architectures. The initialization procedure establishes that information maximally extracted from existing gradients/pixel associations can drive unsupervised, robust mapping—a foundation for active perception and self-supervised learning in embodied agents.
Limitations and Future Directions
The system, as with most direct SLAM and explicit GS pipelines, degrades under strong photometric inconsistency (e.g., heavy motion blur) and struggles in extensively textureless environments, where depth estimation and subsequent splat initialization produce incomplete geometry. Future improvements could include joint inertial integration, adaptive regularization, or global priors to bridge performance gaps under these conditions.
Conclusion
GSO-SLAM introduces a new standard for real-time, dense monocular SLAM: a bidirectionally coupled, EM-optimized framework pairing Direct Sparse Odometry with explicit, differentiable 2D Gaussian Splatting. By fusing direct image alignment, depth prediction, and efficient parametric density representations, the system simultaneously achieves real-time performance, superior geometric/photometric accuracy, and scalability. The framework’s modularity suggests broad applicability to SLAM variants (multi-agent, RGB-D, VI-SLAM) and future extensions incorporating neural priors or learning-based modules.