SCE-SLAM: Scale-Consistent Monocular SLAM

Updated 21 January 2026

SCE-SLAM is a monocular visual SLAM system that achieves global scale consistency using learned patch-level scene coordinate embeddings and geometry-guided propagation.
It employs a dual-branch architecture combining local optical flow and global 3D coordinate constraints to deliver real-time performance at approximately 36 FPS.
Empirical evaluations on benchmarks such as KITTI and Waymo highlight its superior drift control and accuracy without relying on external depth priors.

SCE-SLAM (Scale-Consistent Monocular SLAM via Scene Coordinate Embeddings) is a monocular visual SLAM system designed to address scale drift in long-horizon frame-to-frame camera tracking. Incorporating learned patch-level scene coordinate embeddings, SCE-SLAM achieves global scale consistency by coupling geometry-guided representation propagation with explicit 3D coordinate constraints across sliding windows. The approach operates in real time, maintaining high accuracy on challenging benchmarks (e.g., KITTI, Waymo, vKITTI), without reliance on external depth priors or global all-to-all attention, and is architected to be modular for future sensor fusion (Wu et al., 14 Jan 2026).

1. System Architecture and Pipeline

SCE-SLAM builds upon DPVO's dual-branch SLAM formulation, introducing an explicit mechanism for scale consistency:

Input/output: The system ingests a monocular video stream $(I_1, I_2, \dots)$ and outputs camera poses %%%%1%%%% along with a sparse 3D map $\{X_i \in \mathbb{R}^3\}$ , all expressed in a unified metric scale.
Dual branches:
- Flow branch: Enforces local, scale-agnostic optical flow constraints.
- Scene-coordinate branch: Maintains global scale via learned patch embeddings $h_k^{\rm xyz}$ .
Processing pipeline:
- Feature extraction using a lightweight CNN and frozen DINOv3 backbone, followed by keypoint-based patch sampling (with SuperPoint).
- Recurrent optimization over sliding temporal windows:
- Geometry-Guided Scale Propagation updates embeddings $h^{\rm xyz}$ via 3D spatial attention.
- Scene Coordinate Bundle Adjustment (SCBA) jointly refines camera poses and patch depths, subject to photometric and 3D coordinate penalties.
- Scale consistency is enforced by propagating a canonical scale reference through the embedding updates and by anchoring poses and depths with embedding-decoded 3D coordinate constraints.

This architecture supports efficient parallelization, and achieves a per-frame runtime of approximately 28 ms on an NVIDIA A100 GPU, corresponding to 36 FPS. The computational breakdown is: 50% backbone, 25% flow, 12% scene branch, and 3% bundle adjustment (Wu et al., 14 Jan 2026).

2. Scene Coordinate Embedding Mechanism

Central to SCE-SLAM is the patch-wise scene coordinate embedding:

Embedding definition: For each patch $p_k$ , the system learns

$h^{\rm xyz}_k \in \mathbb{R}^{384}$

encoding its 3D context under a learned canonical scale.

Patch features: Inputs are $3 \times 3$ DINOv3 + CNN feature patches $\{f^{\rm match}_k, f^{\rm ctx}_k\}$ extracted at SuperPoint keypoints, enabling robust multi-view representation.
Canonical scale reference accumulation: Initialized in the first BA pass, embeddings are updated to carry absolute scale information of the local neighborhood.
Decoding scene coordinates:

$(\Delta X_k, w_k) = \mathrm{SCHead}(h_k^{\rm xyz}), \quad X_k^{\rm prior} = X_k + \Delta X_k$

where $X_k = T_t \pi^{-1}(u_k, d_k)$ .

Supervision: Decoded 3D priors $X_k^{\rm prior}$ are supervised with an MSE loss:

$\mathcal{L}_{\rm SC} = \sum_k \| X_k^{\rm prior} - X_k^{\rm GT} \|^2$

Training objective: A composite of flow, pose, and embedding losses,

$\mathcal{L}_{\rm total} = \lambda_1 \mathcal{L}_{\rm flow} + \lambda_2 \mathcal{L}_{\rm pose} + \mathcal{L}_{\rm SC}$

with $\lambda_1=0.1$ , $\lambda_2=10$ (Wu et al., 14 Jan 2026).

3. Geometry-Guided Aggregation of Embeddings

To mitigate scale drift accumulation, SCE-SLAM leverages geometry-guided attention to propagate scale-consistent memory across observations:

Reference set selection: Among patches observed in the closest 30 frames, the 50% with lowest residuals are selected, yielding $|\mathcal{R}| \approx 1200$ patches for aggregation.
Attention mechanism: For active patch $a$ and reference patch $r$ :

$e_{ar} = \frac{Q_a^\top K_r}{\sqrt{d}} - \lambda_{\rm geo} \| X_a - X_r \|^2$

with $Q_a$ , $K_r$ as linear projections of $f^{\rm match}$ , and $\lambda_{\rm geo}$ the geometric penalty coefficient.

Value fusion: Each value vector is augmented geometrically:

$V_r^{\rm geo} = V_r + \mathrm{MLP}_{\rm pos}(X_a - X_r)$

Aggregation and update:

$f_a^{\rm sc} = \mathrm{MLP}_{\rm sc} \left( \sum_r \alpha_{ar} V_r^{\rm geo} \right)$

The embedding is updated via a GRU:

$\tilde{h}_a^{\rm xyz} = h_a^{\rm xyz} + f_a^{\rm sc} + f^{\rm ctx}_a, \quad h_a^{\rm xyz} \leftarrow \mathrm{GRU}(\tilde{h}_a^{\rm xyz})$

Frame-level coupling: Embeddings for patches from the same frame are enforced to be mutually consistent.

This module enables efficient propagation of scale information and enhances robustness in challenging tracking regimes (Wu et al., 14 Jan 2026).

4. Scene Coordinate Bundle Adjustment (SCBA)

The SCBA stage integrates photometric and 3D geometric constraints to anchor trajectories and depths to the canonical scale:

Optimized variables: Camera poses $T_t \in \mathrm{SE}(3)$ , patch depths $d_k$ , and corresponding 3D points $X_k$ .
Residual formulations:
- Optical flow-based reprojection (scale-agnostic):
$r_{ij}^{\rm flow} = w_{ij}^{\rm flow} \left( u_j^{\rm prior} - \pi(T_j T_i^{-1} \pi^{-1}(u_i, d_i)) \right)$ - Scene coordinate penalty (explicitly penalizes metric drift):

$r_k^{\rm xyz} = w_k^{\rm xyz} (X_k^{\rm prior} - X_k)$
Joint optimization objective:

$\min_{\{T_t\}, \{d_k\}} \sum_{(i,j)} \rho(\|r_{ij}^{\rm flow}\|^2) + \sum_k \rho(\|r_k^{\rm xyz}\|^2)$

employing a robust loss $\rho$ .

Optimization schedule:

Flow-only bundle adjustment to initialize canonical scale.
Alternating (one SC-BA pass and two flow-only passes per iteration) for joint refinement.

Numerical solver: Sparse Gauss–Newton, leveraging Schur complement, with Levenberg–Marquardt damping for increased stability.

This alternation ensures efficient convergence while preventing the embedding branch from destabilizing the minimal initialization (Wu et al., 14 Jan 2026).

5. Training, Implementation, and Datasets

SCE-SLAM employs a modular lightweight backbone and is trained on synthetic and real-world datasets:

Datasets: Pre-trained on synthetic TartanAir (240k iterations), fine-tuned and evaluated on KITTI, Waymo, and vKITTI.
Optimization: AdamW optimizer, learning rate $8\times10^{-5}$ , weight decay $1\times10^{-6}$ , batch size 1 (sequence length 15).
Network specifics:
- Backbone: frozen DINOv3 + CNN; 1×1 convolution for fusion.
- Flow branch: DPVO-derived, 384-D GRU state.
- Scene-branch: 384-D GRU for $h^{\rm xyz}$ , SCHead MLP for $\Delta X_k$ and $w_k$ .
- Reference patch graph is updated every window (typically 1200 patches processed).
Computational profile: End-to-end runtime $\approx 28$ ms/frame. Processing-time breakdown:

| Module | % Runtime | Description | |-------------|-----------|--------------------------------| | Backbone | 50% | Feature extraction (DINOv3+CNN)| | Flow Branch | 25% | Optical flow tracking | | Scene Coord | 12% | $h^{\rm xyz}$ embedding updates| | Bundle Adj. | 3% | SCBA joint optimization |

This composition enables real-time operation on high-throughput hardware (Wu et al., 14 Jan 2026).

6. Empirical Evaluation and Performance

SCE-SLAM demonstrates scale-consistent accuracy across realistic evaluation settings:

KITTI Odometry (no loop closure, 11 sequences):
- SCE-SLAM: ATE RMSE $25.79$ m (std $20.7$ m)
- Next best (DPV-SLAM++ w/o loop closure): $25.75$ m; SCE-SLAM is real-time with a better drift profile.
- With loop closure: SCE-SLAM $14.07$ m vs $22.91$ m (DPV-SLAM++).
Waymo (9 sequences):
- SCE-SLAM: mean error $0.915$ m vs $1.996$ m (VGGT-Long).
vKITTI (6 conditions):
- SCE-SLAM: mean error $0.28$ m vs $0.343$ m (DPV-SLAM++).
Qualitative findings:
- Scale-aligned trajectories on KITTI exhibit minimal drift (confirmed by uniform color segments in visualizations).
- Successful global loop closures on 4Seasons dataset; DPV-SLAM++ fails due to scale fragmentation.
- SCE-SLAM maintains real-time performance while matching or surpassing state-of-the-art frame-to-frame SLAM systems (Wu et al., 14 Jan 2026).

7. Limitations and Prospects

SCE-SLAM achieves real-time scale-consistent tracking without heavy external priors or computationally expensive global attention, but presents notable operational limits and avenues for future development:

Limitations:
- Reduced reliability in textureless or feature-poor environments impacts aggregation and thus overall scale robustness.
- Dependence on high-performance GPUs for DINOv3 feature extraction restricts platform deployment.
Advantages:
- Modular SLAM design facilitates future integration of inertial or stereo information in the flow branch.
- Avoids reliance on depth priors and excessive global context; thus, tractable for embedded and resource-constrained systems.
Extensions (as noted by authors):
- Multi-camera or visual–inertial fusion to strengthen global scale cues.
- Online embedding adaptation to novel scenes for improved generalization.
- Hierarchical aggregation for global memory propagation beyond sliding window length.

A plausible implication is that SCE-SLAM's approach to scene coordinate embedding and geometry-aware memory propagation could inform broader monocular SLAM methods where metric consistency in unconstrained long-term deployments is essential (Wu et al., 14 Jan 2026).

Markdown Upgrade to Chat

References (1)

SCE-SLAM: Scale-Consistent Monocular SLAM via Scene Coordinate Embeddings (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SCE-SLAM.