Flash-Mono: Feed-Forward Accelerated Gaussian Splatting Monocular SLAM

Published 3 Apr 2026 in cs.RO | (2604.03092v1)

Abstract: Monocular 3D Gaussian Splatting SLAM suffers from critical limitations in time efficiency, geometric accuracy, and multi-view consistency. These issues stem from the time-consuming $\textit{Train-from-Scratch}$ optimization and the lack of inter-frame scale consistency from single-frame geometry priors. We contend that a feed-forward paradigm, leveraging multi-frame context to predict Gaussian attributes directly, is crucial for addressing these challenges. We present Flash-Mono, a system composed of three core modules: a feed-forward prediction frontend, a 2D Gaussian Splatting mapping backend, and an efficient hidden-state-based loop closure module. We trained a recurrent feed-forward frontend model that progressively aggregates multi-frame visual features into a hidden state via cross attention and jointly predicts camera poses and per-pixel Gaussian properties. By directly predicting Gaussian attributes, our method bypasses the burdensome per-frame optimization required in optimization-based GS-SLAM, achieving a $\textbf{10x}$ speedup while ensuring high-quality rendering. The power of our recurrent architecture extends beyond efficient prediction. The hidden states act as compact submap descriptors, facilitating efficient loop closure and global $\mathrm{Sim}(3)$ optimization to mitigate the long-standing challenge of drift. For enhanced geometric fidelity, we replace conventional 3D Gaussian ellipsoids with 2D Gaussian surfels. Extensive experiments demonstrate that Flash-Mono achieves state-of-the-art performance in both tracking and mapping quality, highlighting its potential for embodied perception and real-time reconstruction applications. Project page: https://victkk.github.io/flash-mono.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces a novel recurrent feed-forward SLAM method that predicts camera poses and dense 2D Gaussian surfel attributes in a single pass.
It leverages a hidden state memory for efficient loop closure and Sim(3) optimization, significantly reducing drift and computational overhead.
Empirical evaluations show state-of-the-art tracking accuracy and rendering quality on diverse datasets while achieving real-time performance.

Flash-Mono: Feed-Forward Accelerated Gaussian Splatting Monocular SLAM

Introduction

Flash-Mono presents a monocular SLAM system that leverages a recurrent feed-forward paradigm to address the inherent efficiency, consistency, and geometric fidelity challenges in Gaussian Splatting-based monocular SLAM. The core contributions include a recurrent network for incremental pose and Gaussian attribute prediction, a hidden state memory for efficient loop closure and Sim(3) optimization, and the adoption of 2D Gaussian surfels as mapping primitives. The methodology departs from the canonical “train-from-scratch” approach, enabling real-time SLAM with significant computational and accuracy advantages.

System Architecture

Recurrent Feed-Forward Frontend

Flash-Mono utilizes a recurrent transformer-based frontend model. For each incoming monocular RGB frame, visual features are extracted via a ViT encoder and fused with a persistent hidden state. The joint architecture predicts absolute camera pose and dense, per-pixel 2D Gaussian attributes in a single forward pass, with the hidden state aggregating multiframe geometry and appearance information. This architectural choice circumvents the iterative optimization required in prior GS-SLAM methods, enabling high frame rates and improved multi-view consistency.

The model is trained end-to-end with multi-task objectives on pose, geometry, and rendering, using ground truth RGB, depth, and camera poses for supervision. The loss combines Euclidean pose loss, surfel geometry loss, and differentiable rendering losses (including MSE, LPIPS, and depth error).

Loop Closure with Hidden State

To mitigate cumulative pose and scale drift, the hidden state mechanism is leveraged during loop closure. Each submap’s hidden state, representing the local context, is cached. Upon loop detection, the system performs a single feed-forward relocalization conditioned on the historical hidden state, obtaining a direct Sim(3) constraint between non-consecutive submaps. This enables robust global pose graph optimization based on Sim(3) constraints, effectively addressing accumulated trajectory errors.

Backend Mapping and Optimization

Predicted 2D Gaussian surfel maps are incrementally voxelized and fused into a global scene map. The predict-and-refine strategy minimizes backend workload: only lightweight local Gaussians optimization is performed after each fusion step, in contrast to exhaustive per-frame optimization in predecessors. Loop closure corrections are applied as direct rigid transformations, efficiently aligning the global map to the updated pose graph while avoiding expensive global map re-optimization.

2D Gaussian Splatting Representation

Recognizing the geometric limitations of pure 3DGS (noise, floaters, poor surfacing), Flash-Mono adopts 2D Gaussian surfels as mapping primitives. Each surfel encodes position, rotation, scale, color, and opacity in image space. This representation imposes a stronger surface prior, suppressing floating artifacts and delivering improved geometric accuracy, especially critical for drift-prone monocular systems. The adaptive voxelization process further enforces map compactness without significant loss in rendering fidelity.

Empirical Evaluation

Flash-Mono is evaluated on challenging large-scale indoor (ScanNet, BundleFusion) and outdoor (KITTI) datasets, benchmarking against state-of-the-art GS-SLAM pipelines (MonoGS, S3PO-GS, DepthGS) and strong visual SLAM systems (ORB-SLAM3, DROID-SLAM, MASt3R-SLAM).

Tracking Accuracy

Flash-Mono achieves state-of-the-art ATE RMSE over previous systems. On ScanNet and BundleFusion, it achieves up to 11.69 cm on ScanNet and 7.34 cm on BundleFusion, outperforming MASt3R-SLAM and all GS baselines, and reducing typical drift by a significant margin (Section 5.2). On KITTI, it delivers robust tracking despite challenging outdoor dynamics, outperforming S3PO-GS by large margins (e.g., 12.85 m vs 32.49 m ATE RMSE on KITTI sequences).

Mapping and Rendering Quality

Rendering quality achieves or exceeds prior GS methods with a 10x reduction in per-frame optimization steps (20 iterations per keyframe vs 250 in MonoGS/S3PO-GS). On critical metrics, Flash-Mono sets new benchmarks: up to 21.73 PSNR and 0.80 SSIM on ScanNet, with LPIPS scores consistently lower than all baselines. Depth L1 errors are lowest among all comparators (0.34/0.21 m on ScanNet/BundleFusion), affirming underlying geometric fidelity. The map compactness (number of Gaussian primitives) is competitive, balancing efficiency and accuracy.

Computational Efficiency

Flash-Mono operates at over 12 FPS (real-time), compared to approximately 1 FPS for prior GS-SLAM approaches. Backend optimization is minimized via high-quality feed-forward prediction. Additional acceleration methods—FP16 attention, CUDA Graphs—enable feasible deployment on resource-constrained hardware.

Ablation Studies

Ablation reveals key system design impacts: loop closure through hidden-state Sim(3) constraints robustly outperforms traditional PnP and no-loop-closure variants; optimal submap lengths mitigate recurrent model forgetting; adaptive voxelization achieves over 58% map compaction with minor PSNR impact.

Practical and Theoretical Implications

The departure from optimize-from-scratch to predict-and-refine marks a clear theoretical advance—dense scene attributes and poses can be predicted accurately and efficiently from learned priors given sufficient multi-frame context. The hidden state architecture offers a mechanism for not only within-session loop closure but also future generalization to life-long mapping and multi-condition relocalization: experiments indicate robust relocalization under significant environmental changes (e.g., day/night, dynamic objects).

The 2D Gaussian surfel paradigm offers a useful geometric and differentiable representation for real-time mapping and rendering tasks, with evident applicability to a wider range of visual geometric perception problems (e.g., foundation models for SLAM, online scene understanding).

Adopting a recurrent feed-forward predictor informs future SLAM system designs, highlighting the trade-off between temporal context length and drift, and underscoring the need for explicit mechanisms to counter recurrent catastrophic forgetting in long sequences.

Future Directions

Potential avenues include explicit training on temporally-varying datasets for life-long mapping, continual adaptation of hidden state representations, and further architectural optimization (quantized networks, efficient attention mechanisms) for resource-constrained applications. The approach generalizes toward multi-modal SLAM (e.g., leveraging inertial or depth sensing), multi-agent mapping, and foundation models for embodied perception.

Conclusion

Flash-Mono establishes a highly efficient, accurate, and robust monocular SLAM framework built upon recurrent feed-forward Gaussian Splatting, hidden state-based loop closure, and adaptive compact mapping. It defines a scalable paradigm for integrating data-driven priors and real-time performance in dense SLAM and sets new quantitative standards for tracking, mapping, and system-level efficiency (2604.03092).

Markdown Report Issue