SING3R-SLAM: Compact Global Dense SLAM
- The paper presents a novel framework that fuses local monocular submap reconstruction with global 3D Gaussian mapping to achieve state-of-the-art tracking and memory efficiency.
- It employs a tightly integrated pipeline—Sub-Track3R for local pose estimation, a Gaussian-based global mapper, and bidirectional loop closure to mitigate drift and optimize scene parameters.
- Empirical evaluations on indoor benchmarks demonstrate superior performance with reduced map storage (7 MB) and improved metrics such as lower ATE and enhanced photorealistic view synthesis.
SING3R-SLAM is a globally consistent and compact dense RGB SLAM framework that fuses local monocular 3D reconstruction priors with global scene modeling based on 3D Gaussian Splatting. Designed for indoor environments, it employs a submap-based approach to tracking and mapping, enabling efficient integration of local geometric detail while mitigating map drift and memory inefficiency typical of prior SLAM frameworks. The architecture features three tightly interwoven modules—Sub-Track3R for local geometry and pose estimation, a global Gaussian Mapper for multi-view optimization, and a bidirectional loop closure mechanism. Through joint optimization of camera trajectories and volumetric scene parameters, SING3R-SLAM achieves state-of-the-art tracking, robust loop closure, detailed and compact 3D geometry, and high-fidelity novel view synthesis (Li et al., 21 Nov 2025).
1. Pipeline Architecture
The SING3R-SLAM pipeline processes an input monocular video of RGB frames via tightly coupled local and global modules:
- Sub-Track3R splits the sequence into overlapping submaps of frames, with for temporal continuity. Each submap is encoded by a 3D encoder (CUT3R [wang2025continuous]) to yield dense point maps and local poses :
Inter-submap registration aligns each submap to a global coordinate frame using the overlap frame and a scale drift :
- Global Gaussian Mapper maintains the global scene as a set of 3D Gaussians with each , where is the mean, the covariance, the opacity, and the color. Differentiable 3DGS rasterization [kerbl2023, zhang2024rade] renders synthesized views from current pose estimates.
Intra-submap pose refinement minimizes a photometric loss and scale-invariant depth loss :
with
Map update is formalized as a sliding-window multi-view optimization:
where the losses are based on color, depth, depth-normal consistency, and Gaussian shape regularization.
- Bidirectional Loop Closure detects loops via reprojection-based covisibility, forms “loop submaps,” and solves for optimal rigid transforms enforcing both adjacency and loop constraints:
These transforms are applied to update global Gaussian parameters and camera poses, propagating loop closure globally.
2. Global Gaussian Scene Representation
SING3R-SLAM models the scene as a set of volumetric “atoms”—3D Gaussians—each parameterized as where . Collectively, these form a differentiable global scene model .
Rendering from using current pose provides per-view synthesized color , depth , and normals for global optimization. The core energy minimized in joint global bundle adjustment is:
This tightly couples geometric and photometric information across all views contributing to the map.
3. Submap Formation, Alignment, and Fusion
Each submap is constructed from overlapping windows of frames, with temporal overlap enforced by . Alignment into a global frame is accomplished by transforming local points and poses via the overlapping pose and correcting for scale drift using as per
Local-to-global fusion proceeds by inserting newly observed 3D Gaussians only into previously unmapped regions while jointly reoptimizing all parameters.
Implicit enforcement of cross-view geometric constraints occurs via intra-submap photometric and depth consistency and multi-view global bundle adjustment losses. Loop closure utilizes the bidirectional formulation above, adjusting both the rigid submap transforms and global scene representation.
4. Optimization, Corrections, and Feedback Loop
SING3R-SLAM operates as a continuous feedback system. After each global optimization, updated depths and poses (denoted and ) are provided to Sub-Track3R, which then uses these improved estimates as priors for the formation and alignment of subsequent submaps. All submap construction is thus grounded in the latest available globally consistent geometry.
This closed-loop strategy leads to robust correction of local drift, with each new submap benefiting from globally optimized camera and scene parameters. As a result, global errors remain tightly bounded even over long sequences.
5. Implementation Specifics and Memory Efficiency
Sub-Track3R is implemented using the CUT3R encoder with frames and one-frame overlap, removing the need for costly feature matching and yielding tracking speeds comparable to MASt3R-SLAM. The Gaussian Mapper leverages RaDe-GS for fast differentiable rasterization. Key optimization hyperparameters include , , , and for map updates.
A salient property of the approach is its memory efficiency: the use of continuous volumetric Gaussians reduces global map size to approximately 7 MB for large indoor scenes, in contrast to 110 MB for MASt3R-SLAM and 9 MB for HI-SLAM2. This reduction is achieved without loss of detail or accuracy, demonstrating the effectiveness of the Gaussian representation for map compactness (Li et al., 21 Nov 2025).
Table: Comparative Memory Footprint
| Method | Map Size (MB) |
|---|---|
| SING3R-SLAM | 7 |
| HI-SLAM2 | 9 |
| MASt3R-SLAM | 110 |
6. Empirical Evaluation
SING3R-SLAM demonstrates state-of-the-art quantitative performance across major indoor SLAM and reconstruction benchmarks. On the 7-Scenes dataset, SING3R-SLAM achieves an average Absolute Trajectory Error (ATE) of 4.8 cm, improving upon HI-SLAM2 (5.5 cm), MASt3R-SLAM (6.6 cm), and VGGT-SLAM (6.7 cm)—an improvement of over 12%. On ScanNet-v2, photorealistic view synthesis achieves PSNR=30.47 dB (compared to 29.48 dB for Splat-SLAM and 29.27 dB for HI-SLAM2), SSIM=0.89, and LPIPS=0.21, indicating that enhanced global geometric consistency directly improves view synthesis.
Surface reconstruction quality on 7-Scenes attains Accuracy/Completeness/Chamfer values of (0.056 / 0.057 / 0.057). Ablation studies on ScanNet (scene_0059) demonstrate progressive accuracy gains as each system component is added, with the full pipeline reaching ATE=7.20 cm and PSNR=29.44 dB. Tracking runtime is ≈5 min, mapping ≈10 min, and global bundle adjustment ≈8 min.
Table: Ablation Study Results (ScanNet scene_0059)
| Configuration | ATE (cm) | PSNR (dB) |
|---|---|---|
| Sub-Track3R only | 104.2 | — |
| + point-map loop closure | 34.3 | — |
| + Gaussian mapping | 24.5 | 20.17 |
| + point loop + Gauss map | 11.2 | 26.72 |
| + bidirectional loop + Gauss map | 12.25 | — |
| + intra-submap registration | 9.39 | — |
| Full pipeline | 7.20 | 29.44 |
7. Applications and Implications
SING3R-SLAM provides a unified and efficient map representation supporting multiple downstream tasks, including precise visual tracking, dense 3D reconstruction, and high-quality novel view synthesis. The framework demonstrates that locally accurate monocular submaps, when fused into a globally optimized Gaussian scene model, can substantially advance both geometric and photometric SLAM performance while retaining minimal memory overhead (Li et al., 21 Nov 2025).
A plausible implication is that future visual SLAM and NeRF-based scene modeling pipelines may increasingly adopt such combined local-global architectures, leveraging learned monocular priors for semi-dense reconstruction and volumetric Gaussian representations for cross-view optimization and compact storage.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free