ACE-SLAM: Neural RGB-D SLAM System

Updated 18 December 2025

ACE-SLAM is a neural RGB-D SLAM system that uses scene coordinate regression as an implicit map representation for efficient, real-time, and privacy-preserving 3D mapping.
It employs a lightweight TriMLP architecture with triplane voting grids, achieving a 30–50% RMSE reduction over traditional methods and strong performance in dynamic settings.
The system integrates always-on relocalization and loop closure, processing RGB-D data at 29.7 FPS while maintaining a compact map size of approximately 1 MB.

ACE-SLAM is a neural RGB-D Simultaneous Localization and Mapping (SLAM) system that establishes Scene Coordinate Regression (SCR) as the core implicit map representation for real-time SLAM. The approach involves a lightweight neural network that directly maps image features to global 3D scene coordinates, enabling compact, efficient, and privacy-preserving map representations. ACE-SLAM achieves strict real-time performance and robust relocalization, supporting both sparse and dense features, and operates reliably in dynamic environments without the need for specialized adaptation (Alzugaray et al., 16 Dec 2025).

1. System Pipeline and Implicit Map Representation

ACE-SLAM processes a live RGB–D input stream $\{I^t, D^t\}$ . Feature extraction is performed by a frozen pre-trained encoder $\varphi$ —either ACE or SuperPoint—operating asynchronously. This module produces a set of image features $\{(u_i^t, d_i^t, f_i^t)\}$ where $u_i^t \in \mathbb{R}^2$ is the keypoint, $d_i^t \in \mathbb{R}$ is the depth, and $f_i^t \in \mathbb{R}^D$ is the appearance descriptor.

The SCR network, denoted $f_0$ , acts as the implicit map: for each feature, it predicts a global 3D coordinate $X_i = f_0(f_i)$ . Pose estimation is executed via “always-on relocalization” by creating 2D–3D correspondences $\{u_i \leftrightarrow X_i\}$ and recovering the camera pose $P^t$ through RANSAC followed by closed-form Kabsch (Umeyama) refinement. Mapping is handled in a parallel thread that optimizes over a small window of keyframes using stochastic gradient descent (SGD) to minimize the geometric residuals with respect to the SCR parameters $\Theta$ .

SCR stores the entire scene in the network weights $\Theta$ , which require approximately 1 MB. Inference scales as $O(M)$ with the number of features $M$ , and relocalization is achieved in approximately 11 ms. The method naturally supports loop closure and privacy, as global coordinates are only produced for actual in-scene features, and decouples memory usage from scene size (Alzugaray et al., 16 Dec 2025).

2. Scene Coordinate Regression Network Architecture

ACE-SLAM’s mapping function is formalized as:

$f_0: \mathbb{R}^D \rightarrow \mathbb{R}^3,\quad X_i = f_0(u_i)$

The system introduces TriMLP—a lightweight MLP that computes logits for triplane voting grids on the $XY$ , $XZ$ , and $YZ$ coordinate planes:

$C_i^{XY},\, C_i^{XZ},\, C_i^{YZ} = \operatorname{softmax}(\text{MLP}(u_i))$

with $C_i^{IJ} \in \mathbb{R}^{r_I \times r_J}$ . Each grid votes for its 2D plane-coordinates:

$(\tilde{x}_i^{XY}, \tilde{y}_i^{XY}) = \sum_{p,q} B^{XY}_{p,q} \cdot C^{XY}_{i,p,q}$

(similar expressions for $XZ$ and $YZ$ with predefined basis $B^{IJ}$ ). The 3D prediction is the average of compatible results:

$X_i = \frac{1}{2} \begin{bmatrix} \tilde{x}_i^{XY} + \tilde{x}_i^{XZ} \ \tilde{y}_i^{XY} + \tilde{y}_i^{YZ} \ \tilde{z}_i^{XZ} + \tilde{z}_i^{YZ} \end{bmatrix}$

The voting-based factorization instills inductive bias, enhancing interpretability and acceleration of online adaptation. Self-supervised training is performed during SLAM by minimizing the discrepancy between the predicted and back-projected 3D coordinates:

$r_i^t(\Theta, P^t) = \| f_0(u_i^t; \Theta) - P^t x_{i,\text{local}}^t \|^2$

where $x_{i,\text{local}}^t$ is the 3D back-projection from $(u_i^t, d_i^t)$ , with the global loss:

$L(\Theta, P) = \sum_t \sum_i \| f_0(u_i^t) - P^t x_{i,\text{local}}^t \|^2$

Supervision derives exclusively from RGB-D geometric consistency, without external ground-truth scene coordinates.

3. Real-Time SLAM Integration and Optimization

Upon arrival of each RGB–D frame, feature descriptors are extracted and back-projected. Scene coordinates $X_i^t = f_0(u_i^t)$ are inferred, and the camera pose $P^t$ is solved by minimizing:

$P^t = \arg\min_{P \in SE(3)} \sum_{i=1}^n \| X_i^t - P x_{i,\text{local}}^t \|^2$

This estimation employs RANSAC on feature triplets followed by closed-form refinement (see equations 5–7 in the original source). Pose estimation reliability is tracked via the inlier ratio $\rho^t$ .

Keyframes are inserted if the elapsed time exceeds $K_t^{\text{min}}$ or $\rho^t < \rho_{\min}$ . The mapping thread maintains a constant-size optimization window $\mathcal{W}$ involving recent and select lower-quality keyframes (weighted by $1-\rho^t$ ). Each window iteration relocalizes keyframes, samples feature tuples (emphasizing low-inlier frames), and performs several SGD steps to minimize the residual loss with respect to SCR parameters. This strategy guarantees fixed computational overhead per mapping cycle, supporting real-time operation.

Relocalization in ACE-SLAM is "always-on," hypothesis-and-test with no reliance on pose priors. Robustness to dynamic objects is achieved by automatically downweighting regions with low $\rho^t$ , while loop closure is realized as keyframes reestablish correspondence through the global implicit map.

4. Quantitative Evaluation and Benchmarks

ACE-SLAM demonstrates highly competitive accuracy on standard benchmarks compared to prior neural SLAM systems. In static scenarios (Replica, TUM, ScanNet), the absolute trajectory error (ATE) RMSE for TriMLP with ACE dense features falls within:

Replica: 0.027–0.049 m
TUM: 0.083 m
ScanNet: 0.164–0.212 m

Ablations show TriMLP yields a 30–50% RMSE reduction over HomMLP. Dense features from ACE are critical; sparse SuperPoint descriptors degrade performance in larger environments.

In dynamic scenes (TUM-RGBD dynamic), ACE-SLAM achieves RMSEs matching or surpassing dynamic-specialized pipelines (e.g., FND-SLAM, NID-SLAM) without any explicit segmentation, e.g., fr3_s_half: 0.049 m (second-best, with FND-SLAM at 0.015 m).

Performance metrics highlight real-time operation:

End-to-end: 29.7 FPS on RTX 4090 (99% real-time factor)
Per-frame relocalization: 11 ms (ACE) or 13 ms (SuperPoint)
Map size: ≈1.1 MB (TriMLP+ACE), 0.86 MB (SuperPoint)

A comparative overview:

System	FPS	Map Size
ACE-SLAM	29.7	1.1 MB
iMAP	0.15	0.99 MB
NICE-SLAM	0.33	95 MB
ESLAM	7.35	45 MB

Explicit BA/LC-based pipelines (e.g., GO-SLAM, CO-SLAM) achieve higher accuracy, but with more computational overhead and complexity (Alzugaray et al., 16 Dec 2025).

5. Advantages, Trade-Offs, and Limitations

ACE-SLAM enables fully online real-time neural SLAM, combines mapping and tracking at 30 FPS, and maintains a highly compact implicit representation independent of scene size. Its always-on relocalization, loop closure, and robustness to dynamic motion are realized without recourse to semantic, volumetric, or segmentation modules. Privacy is inherently preserved, as scene geometry is only accessible via feature descriptors present in observed frames.

Noted limitations include:

Suboptimal accuracy compared to state-of-the-art pipelines with explicit bundle adjustment (BA) and loop closure (LC) or semantic components.
Sensitivity to sparse or non-discriminative features, particularly in larger environments with only SuperPoint features.
The TriMLP’s architectural simplicity favors rapid adaptation but may suppress recovery of fine geometric detail relative to larger-capacity or volumetric networks.
In rapidly changing or previously unseen regions, the inlier ratio temporarily declines until mapping adapts, though the conservative keyframe approach mitigates extended failures.

A plausible implication is that while ACE-SLAM is optimized for real-time deployment and minimal memory, its current architecture may be best suited for scenarios prioritizing efficiency, robustness to dynamic scenes, and privacy, rather than maximal geometric fidelity (Alzugaray et al., 16 Dec 2025).