Papers
Topics
Authors
Recent
Search
2000 character limit reached

SCE-SLAM: Scale-Consistent Monocular SLAM

Updated 21 January 2026
  • SCE-SLAM is a monocular visual SLAM system that achieves global scale consistency using learned patch-level scene coordinate embeddings and geometry-guided propagation.
  • It employs a dual-branch architecture combining local optical flow and global 3D coordinate constraints to deliver real-time performance at approximately 36 FPS.
  • Empirical evaluations on benchmarks such as KITTI and Waymo highlight its superior drift control and accuracy without relying on external depth priors.

SCE-SLAM (Scale-Consistent Monocular SLAM via Scene Coordinate Embeddings) is a monocular visual SLAM system designed to address scale drift in long-horizon frame-to-frame camera tracking. Incorporating learned patch-level scene coordinate embeddings, SCE-SLAM achieves global scale consistency by coupling geometry-guided representation propagation with explicit 3D coordinate constraints across sliding windows. The approach operates in real time, maintaining high accuracy on challenging benchmarks (e.g., KITTI, Waymo, vKITTI), without reliance on external depth priors or global all-to-all attention, and is architected to be modular for future sensor fusion (Wu et al., 14 Jan 2026).

1. System Architecture and Pipeline

SCE-SLAM builds upon DPVO's dual-branch SLAM formulation, introducing an explicit mechanism for scale consistency:

  • Input/output: The system ingests a monocular video stream (I1,I2,)(I_1, I_2, \dots) and outputs camera poses %%%%1%%%% along with a sparse 3D map {XiR3}\{X_i \in \mathbb{R}^3\}, all expressed in a unified metric scale.
  • Dual branches:
    • Flow branch: Enforces local, scale-agnostic optical flow constraints.
    • Scene-coordinate branch: Maintains global scale via learned patch embeddings hkxyzh_k^{\rm xyz}.
  • Processing pipeline:
    • Feature extraction using a lightweight CNN and frozen DINOv3 backbone, followed by keypoint-based patch sampling (with SuperPoint).
    • Recurrent optimization over sliding temporal windows:
    • Geometry-Guided Scale Propagation updates embeddings hxyzh^{\rm xyz} via 3D spatial attention.
    • Scene Coordinate Bundle Adjustment (SCBA) jointly refines camera poses and patch depths, subject to photometric and 3D coordinate penalties.
    • Scale consistency is enforced by propagating a canonical scale reference through the embedding updates and by anchoring poses and depths with embedding-decoded 3D coordinate constraints.

This architecture supports efficient parallelization, and achieves a per-frame runtime of approximately 28 ms on an NVIDIA A100 GPU, corresponding to 36 FPS. The computational breakdown is: 50% backbone, 25% flow, 12% scene branch, and 3% bundle adjustment (Wu et al., 14 Jan 2026).

2. Scene Coordinate Embedding Mechanism

Central to SCE-SLAM is the patch-wise scene coordinate embedding:

  • Embedding definition: For each patch pkp_k, the system learns

hkxyzR384h^{\rm xyz}_k \in \mathbb{R}^{384}

encoding its 3D context under a learned canonical scale.

  • Patch features: Inputs are 3×33 \times 3 DINOv3 + CNN feature patches {fkmatch,fkctx}\{f^{\rm match}_k, f^{\rm ctx}_k\} extracted at SuperPoint keypoints, enabling robust multi-view representation.
  • Canonical scale reference accumulation: Initialized in the first BA pass, embeddings are updated to carry absolute scale information of the local neighborhood.
  • Decoding scene coordinates:

(ΔXk,wk)=SCHead(hkxyz),Xkprior=Xk+ΔXk(\Delta X_k, w_k) = \mathrm{SCHead}(h_k^{\rm xyz}), \quad X_k^{\rm prior} = X_k + \Delta X_k

where Xk=Ttπ1(uk,dk)X_k = T_t \pi^{-1}(u_k, d_k).

  • Supervision: Decoded 3D priors XkpriorX_k^{\rm prior} are supervised with an MSE loss:

LSC=kXkpriorXkGT2\mathcal{L}_{\rm SC} = \sum_k \| X_k^{\rm prior} - X_k^{\rm GT} \|^2

  • Training objective: A composite of flow, pose, and embedding losses,

Ltotal=λ1Lflow+λ2Lpose+LSC\mathcal{L}_{\rm total} = \lambda_1 \mathcal{L}_{\rm flow} + \lambda_2 \mathcal{L}_{\rm pose} + \mathcal{L}_{\rm SC}

with λ1=0.1\lambda_1=0.1, λ2=10\lambda_2=10 (Wu et al., 14 Jan 2026).

3. Geometry-Guided Aggregation of Embeddings

To mitigate scale drift accumulation, SCE-SLAM leverages geometry-guided attention to propagate scale-consistent memory across observations:

  • Reference set selection: Among patches observed in the closest 30 frames, the 50% with lowest residuals are selected, yielding R1200|\mathcal{R}| \approx 1200 patches for aggregation.
  • Attention mechanism: For active patch aa and reference patch rr:

ear=QaKrdλgeoXaXr2e_{ar} = \frac{Q_a^\top K_r}{\sqrt{d}} - \lambda_{\rm geo} \| X_a - X_r \|^2

with QaQ_a, KrK_r as linear projections of fmatchf^{\rm match}, and λgeo\lambda_{\rm geo} the geometric penalty coefficient.

  • Value fusion: Each value vector is augmented geometrically:

Vrgeo=Vr+MLPpos(XaXr)V_r^{\rm geo} = V_r + \mathrm{MLP}_{\rm pos}(X_a - X_r)

  • Aggregation and update:

fasc=MLPsc(rαarVrgeo)f_a^{\rm sc} = \mathrm{MLP}_{\rm sc} \left( \sum_r \alpha_{ar} V_r^{\rm geo} \right)

The embedding is updated via a GRU:

h~axyz=haxyz+fasc+factx,haxyzGRU(h~axyz)\tilde{h}_a^{\rm xyz} = h_a^{\rm xyz} + f_a^{\rm sc} + f^{\rm ctx}_a, \quad h_a^{\rm xyz} \leftarrow \mathrm{GRU}(\tilde{h}_a^{\rm xyz})

  • Frame-level coupling: Embeddings for patches from the same frame are enforced to be mutually consistent.

This module enables efficient propagation of scale information and enhances robustness in challenging tracking regimes (Wu et al., 14 Jan 2026).

4. Scene Coordinate Bundle Adjustment (SCBA)

The SCBA stage integrates photometric and 3D geometric constraints to anchor trajectories and depths to the canonical scale:

  • Optimized variables: Camera poses TtSE(3)T_t \in \mathrm{SE}(3), patch depths dkd_k, and corresponding 3D points XkX_k.
  • Residual formulations:
    • Optical flow-based reprojection (scale-agnostic):

    rijflow=wijflow(ujpriorπ(TjTi1π1(ui,di)))r_{ij}^{\rm flow} = w_{ij}^{\rm flow} \left( u_j^{\rm prior} - \pi(T_j T_i^{-1} \pi^{-1}(u_i, d_i)) \right) - Scene coordinate penalty (explicitly penalizes metric drift):

    rkxyz=wkxyz(XkpriorXk)r_k^{\rm xyz} = w_k^{\rm xyz} (X_k^{\rm prior} - X_k)

  • Joint optimization objective:

min{Tt},{dk}(i,j)ρ(rijflow2)+kρ(rkxyz2)\min_{\{T_t\}, \{d_k\}} \sum_{(i,j)} \rho(\|r_{ij}^{\rm flow}\|^2) + \sum_k \rho(\|r_k^{\rm xyz}\|^2)

employing a robust loss ρ\rho.

  • Optimization schedule:
  1. Flow-only bundle adjustment to initialize canonical scale.
  2. Alternating (one SC-BA pass and two flow-only passes per iteration) for joint refinement.
  • Numerical solver: Sparse Gauss–Newton, leveraging Schur complement, with Levenberg–Marquardt damping for increased stability.

This alternation ensures efficient convergence while preventing the embedding branch from destabilizing the minimal initialization (Wu et al., 14 Jan 2026).

5. Training, Implementation, and Datasets

SCE-SLAM employs a modular lightweight backbone and is trained on synthetic and real-world datasets:

  • Datasets: Pre-trained on synthetic TartanAir (240k iterations), fine-tuned and evaluated on KITTI, Waymo, and vKITTI.

  • Optimization: AdamW optimizer, learning rate 8×1058\times10^{-5}, weight decay 1×1061\times10^{-6}, batch size 1 (sequence length 15).

  • Network specifics:

    • Backbone: frozen DINOv3 + CNN; 1×1 convolution for fusion.
    • Flow branch: DPVO-derived, 384-D GRU state.
    • Scene-branch: 384-D GRU for hxyzh^{\rm xyz}, SCHead MLP for ΔXk\Delta X_k and wkw_k.
    • Reference patch graph is updated every window (typically 1200 patches processed).
  • Computational profile: End-to-end runtime 28\approx 28 ms/frame. Processing-time breakdown:

| Module | % Runtime | Description | |-------------|-----------|--------------------------------| | Backbone | 50% | Feature extraction (DINOv3+CNN)| | Flow Branch | 25% | Optical flow tracking | | Scene Coord | 12% | hxyzh^{\rm xyz} embedding updates| | Bundle Adj. | 3% | SCBA joint optimization |

This composition enables real-time operation on high-throughput hardware (Wu et al., 14 Jan 2026).

6. Empirical Evaluation and Performance

SCE-SLAM demonstrates scale-consistent accuracy across realistic evaluation settings:

  • KITTI Odometry (no loop closure, 11 sequences):
    • SCE-SLAM: ATE RMSE $25.79$ m (std $20.7$ m)
    • Next best (DPV-SLAM++ w/o loop closure): $25.75$ m; SCE-SLAM is real-time with a better drift profile.
    • With loop closure: SCE-SLAM $14.07$ m vs $22.91$ m (DPV-SLAM++).
  • Waymo (9 sequences):
    • SCE-SLAM: mean error $0.915$ m vs $1.996$ m (VGGT-Long).
  • vKITTI (6 conditions):
    • SCE-SLAM: mean error $0.28$ m vs $0.343$ m (DPV-SLAM++).
  • Qualitative findings:
    • Scale-aligned trajectories on KITTI exhibit minimal drift (confirmed by uniform color segments in visualizations).
    • Successful global loop closures on 4Seasons dataset; DPV-SLAM++ fails due to scale fragmentation.
    • SCE-SLAM maintains real-time performance while matching or surpassing state-of-the-art frame-to-frame SLAM systems (Wu et al., 14 Jan 2026).

7. Limitations and Prospects

SCE-SLAM achieves real-time scale-consistent tracking without heavy external priors or computationally expensive global attention, but presents notable operational limits and avenues for future development:

  • Limitations:
    • Reduced reliability in textureless or feature-poor environments impacts aggregation and thus overall scale robustness.
    • Dependence on high-performance GPUs for DINOv3 feature extraction restricts platform deployment.
  • Advantages:
    • Modular SLAM design facilitates future integration of inertial or stereo information in the flow branch.
    • Avoids reliance on depth priors and excessive global context; thus, tractable for embedded and resource-constrained systems.
  • Extensions (as noted by authors):
    • Multi-camera or visual–inertial fusion to strengthen global scale cues.
    • Online embedding adaptation to novel scenes for improved generalization.
    • Hierarchical aggregation for global memory propagation beyond sliding window length.

A plausible implication is that SCE-SLAM's approach to scene coordinate embedding and geometry-aware memory propagation could inform broader monocular SLAM methods where metric consistency in unconstrained long-term deployments is essential (Wu et al., 14 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SCE-SLAM.