- The paper introduces a novel recurrent feed-forward SLAM method that predicts camera poses and dense 2D Gaussian surfel attributes in a single pass.
- It leverages a hidden state memory for efficient loop closure and Sim(3) optimization, significantly reducing drift and computational overhead.
- Empirical evaluations show state-of-the-art tracking accuracy and rendering quality on diverse datasets while achieving real-time performance.
Flash-Mono: Feed-Forward Accelerated Gaussian Splatting Monocular SLAM
Introduction
Flash-Mono presents a monocular SLAM system that leverages a recurrent feed-forward paradigm to address the inherent efficiency, consistency, and geometric fidelity challenges in Gaussian Splatting-based monocular SLAM. The core contributions include a recurrent network for incremental pose and Gaussian attribute prediction, a hidden state memory for efficient loop closure and Sim(3) optimization, and the adoption of 2D Gaussian surfels as mapping primitives. The methodology departs from the canonical “train-from-scratch” approach, enabling real-time SLAM with significant computational and accuracy advantages.
System Architecture
Recurrent Feed-Forward Frontend
Flash-Mono utilizes a recurrent transformer-based frontend model. For each incoming monocular RGB frame, visual features are extracted via a ViT encoder and fused with a persistent hidden state. The joint architecture predicts absolute camera pose and dense, per-pixel 2D Gaussian attributes in a single forward pass, with the hidden state aggregating multiframe geometry and appearance information. This architectural choice circumvents the iterative optimization required in prior GS-SLAM methods, enabling high frame rates and improved multi-view consistency.
The model is trained end-to-end with multi-task objectives on pose, geometry, and rendering, using ground truth RGB, depth, and camera poses for supervision. The loss combines Euclidean pose loss, surfel geometry loss, and differentiable rendering losses (including MSE, LPIPS, and depth error).
Loop Closure with Hidden State
To mitigate cumulative pose and scale drift, the hidden state mechanism is leveraged during loop closure. Each submap’s hidden state, representing the local context, is cached. Upon loop detection, the system performs a single feed-forward relocalization conditioned on the historical hidden state, obtaining a direct Sim(3) constraint between non-consecutive submaps. This enables robust global pose graph optimization based on Sim(3) constraints, effectively addressing accumulated trajectory errors.
Backend Mapping and Optimization
Predicted 2D Gaussian surfel maps are incrementally voxelized and fused into a global scene map. The predict-and-refine strategy minimizes backend workload: only lightweight local Gaussians optimization is performed after each fusion step, in contrast to exhaustive per-frame optimization in predecessors. Loop closure corrections are applied as direct rigid transformations, efficiently aligning the global map to the updated pose graph while avoiding expensive global map re-optimization.
2D Gaussian Splatting Representation
Recognizing the geometric limitations of pure 3DGS (noise, floaters, poor surfacing), Flash-Mono adopts 2D Gaussian surfels as mapping primitives. Each surfel encodes position, rotation, scale, color, and opacity in image space. This representation imposes a stronger surface prior, suppressing floating artifacts and delivering improved geometric accuracy, especially critical for drift-prone monocular systems. The adaptive voxelization process further enforces map compactness without significant loss in rendering fidelity.
Empirical Evaluation
Flash-Mono is evaluated on challenging large-scale indoor (ScanNet, BundleFusion) and outdoor (KITTI) datasets, benchmarking against state-of-the-art GS-SLAM pipelines (MonoGS, S3PO-GS, DepthGS) and strong visual SLAM systems (ORB-SLAM3, DROID-SLAM, MASt3R-SLAM).
Tracking Accuracy
Flash-Mono achieves state-of-the-art ATE RMSE over previous systems. On ScanNet and BundleFusion, it achieves up to 11.69 cm on ScanNet and 7.34 cm on BundleFusion, outperforming MASt3R-SLAM and all GS baselines, and reducing typical drift by a significant margin (Section 5.2). On KITTI, it delivers robust tracking despite challenging outdoor dynamics, outperforming S3PO-GS by large margins (e.g., 12.85 m vs 32.49 m ATE RMSE on KITTI sequences).
Mapping and Rendering Quality
Rendering quality achieves or exceeds prior GS methods with a 10x reduction in per-frame optimization steps (20 iterations per keyframe vs 250 in MonoGS/S3PO-GS). On critical metrics, Flash-Mono sets new benchmarks: up to 21.73 PSNR and 0.80 SSIM on ScanNet, with LPIPS scores consistently lower than all baselines. Depth L1 errors are lowest among all comparators (0.34/0.21 m on ScanNet/BundleFusion), affirming underlying geometric fidelity. The map compactness (number of Gaussian primitives) is competitive, balancing efficiency and accuracy.
Computational Efficiency
Flash-Mono operates at over 12 FPS (real-time), compared to approximately 1 FPS for prior GS-SLAM approaches. Backend optimization is minimized via high-quality feed-forward prediction. Additional acceleration methods—FP16 attention, CUDA Graphs—enable feasible deployment on resource-constrained hardware.
Ablation Studies
Ablation reveals key system design impacts: loop closure through hidden-state Sim(3) constraints robustly outperforms traditional PnP and no-loop-closure variants; optimal submap lengths mitigate recurrent model forgetting; adaptive voxelization achieves over 58% map compaction with minor PSNR impact.
Practical and Theoretical Implications
The departure from optimize-from-scratch to predict-and-refine marks a clear theoretical advance—dense scene attributes and poses can be predicted accurately and efficiently from learned priors given sufficient multi-frame context. The hidden state architecture offers a mechanism for not only within-session loop closure but also future generalization to life-long mapping and multi-condition relocalization: experiments indicate robust relocalization under significant environmental changes (e.g., day/night, dynamic objects).
The 2D Gaussian surfel paradigm offers a useful geometric and differentiable representation for real-time mapping and rendering tasks, with evident applicability to a wider range of visual geometric perception problems (e.g., foundation models for SLAM, online scene understanding).
Adopting a recurrent feed-forward predictor informs future SLAM system designs, highlighting the trade-off between temporal context length and drift, and underscoring the need for explicit mechanisms to counter recurrent catastrophic forgetting in long sequences.
Future Directions
Potential avenues include explicit training on temporally-varying datasets for life-long mapping, continual adaptation of hidden state representations, and further architectural optimization (quantized networks, efficient attention mechanisms) for resource-constrained applications. The approach generalizes toward multi-modal SLAM (e.g., leveraging inertial or depth sensing), multi-agent mapping, and foundation models for embodied perception.
Conclusion
Flash-Mono establishes a highly efficient, accurate, and robust monocular SLAM framework built upon recurrent feed-forward Gaussian Splatting, hidden state-based loop closure, and adaptive compact mapping. It defines a scalable paradigm for integrating data-driven priors and real-time performance in dense SLAM and sets new quantitative standards for tracking, mapping, and system-level efficiency (2604.03092).