DropD-SLAM: Real-Time Monocular SLAM System

Updated 8 October 2025

The paper introduces DropD-SLAM, a system that substitutes hardware depth sensors with pretrained vision modules to achieve RGB-D-level accuracy.
It employs a modular pipeline integrating depth estimation, keypoint detection, and instance segmentation to filter dynamic regions and enable robust 3D mapping.
Performance evaluations on the TUM RGB-D benchmark demonstrate competitive trajectory accuracy and real-time operation at 22 FPS, highlighting its efficiency and scalability.

DropD-SLAM is a monocular real-time SLAM system that achieves RGB-D-level accuracy without active depth sensing by leveraging modern, pretrained vision modules. The approach substitutes hardware depth sensors with software-based metric depth prediction, keypoint detection, and instance segmentation, yielding robust tracking and mapping capabilities in dynamic environments. DropD-SLAM matches or surpasses state-of-the-art RGB-D systems on public benchmarks while operating efficiently on commodity hardware, demonstrating the effectiveness of deep vision models as reliable, real-time sources of metric scale.

1. System Architecture

DropD-SLAM employs a modular pipeline, with the front end composed of three pretrained vision modules:

Monocular Metric Depth Estimator: Provides dense, metrically scaled depth maps (e.g., DepthAnythingV2 or UniDepthV2).
Learned Keypoint Detector: Extracts 2D keypoints with binary descriptors (e.g., Key.Net), optimized for repeatability under visual distortion and low texture.
Instance Segmentation Network: Generates binary masks for object instances (e.g., YOLOv11) to identify dynamic regions.

Each incoming RGB frame undergoes parallel processing:

The depth estimator produces a dense depth map in real-world units.
The keypoint detector locates salient image points and computes descriptors.
The segmentation network identifies potential dynamic objects, whose masks are then morphologically dilated to encompass boundary uncertainty.

Dynamic regions are suppressed by excluding keypoints within the dilated masks. The remaining static keypoints are associated with their corresponding depth estimates, then backprojected into 3D using the camera intrinsic matrix $K$ :

$P_i = d_i \cdot K^{-1} \cdot [u_i^{\mathsf{T}},\, 1]^{\mathsf{T}}$

where $P_i$ is the 3D position, $d_i$ is the predicted depth, and $u_i$ is the keypoint coordinate. The set of 3D keypoints and descriptors is forwarded to an unmodified RGB-D SLAM backend (e.g., ORB-SLAM3), which handles geometric optimization, tracking, mapping, and loop closure without any architectural changes.

2. Key Innovations

DropD-SLAM introduces several novel mechanisms for monocular SLAM:

Replacement of active depth sensing with pretrained neural depth estimators delivering metrically scaled depth for every pixel, thus providing absolute scale for 3D reconstruction.
Dynamic Object Suppression via instance segmentation and morphological dilation, ensuring that keypoints originating from moving objects are removed prior to SLAM optimization. This minimizes pose drift and mis-triangulation in dynamic scenes.
Uniform Keypoint Distribution exploiting modern detectors (e.g., Key.Net) that maintain feature repeatability under motion blur and variable texture, thereby increasing system stability and robustness.
Direct compatibility with legacy RGB-D back ends by providing metrically accurate, static keypoints in 3D, allowing seamless drop-in replacement of hardware depth input.

This architectural choice demonstrates that pretrained vision models can be harnessed to recover metric scale at each frame, and that a modular, plug-and-play integration is viable for high-precision SLAM.

3. Performance Analysis

Evaluation on the TUM RGB-D benchmark reveals:

Static sequences: Mean Absolute Trajectory Error (ATE) of $7.4\,\mathrm{cm}$ ,
Dynamic sequences: Mean ATE of $1.8\,\mathrm{cm}$ ,

delivering competitive or superior accuracy compared to state-of-the-art RGB-D methods. In dynamic scenes, where accurate tracking is challenging due to object motion, DropD-SLAM’s semantic suppression and depth association techniques provide resilience, resulting in notably lower trajectory error. The system operates at $22\,\mathrm{FPS}$ on a single GPU, confirming suitability for real-time deployment.

4. Technical Workflow

Keypoint Filtering and Backprojection

After mask dilation, static keypoints are selected and associated with their depth estimates; each is then transformed into a metric 3D coordinate via

$P_i = d_i \cdot K^{-1} \cdot [u_i^{\mathsf{T}},\, 1]^{\mathsf{T}}$

where range clipping may be applied to $d_i$ for reliability. This process ensures that only reliable, static features are used for SLAM, eliminating contributions from potentially dynamic regions.

SLAM Backend Integration

No modification is required for the SLAM backend, which expects 3D feature points and associated descriptors. Camera tracking, mapping, and loop closure proceed as in a conventional RGB-D system.

5. Practical Implications and System Advantages

DropD-SLAM eliminates the need for active depth sensors such as RGB-D cameras or LiDAR, offering several advantages:

Reduced hardware cost and system complexity: All computations use RGB imagery and GPU, avoiding the expense and integration challenges of specialized depth sensors.
Robustness across environmental conditions: Unlike hardware depth sensors, which may fail in sunlight, reflective surfaces, or low-light situations, vision modules operate reliably across standard indoor scenarios.
Low power and weight requirements: No additional sensor suite, thus suitable for lightweight robots and embedded platforms.

A plausible implication is that the approach facilitates scalable, cost-effective SLAM deployment on mobile devices and consumer-grade robotics.

6. Future Directions

The authors propose several areas for continued research:

Domain adaptation: Pretrained vision models exhibit decreased accuracy in out-of-distribution scenarios; future efforts may focus on improving robustness to environmental variation.
Uncertainty modeling: Current depth estimation treats depth as a fixed prior; modeling its uncertainty within the SLAM backend could reduce bias and improve convergence.
Modular upgradeability: As pretrained depth, keypoint, or segmentation models advance, DropD-SLAM can incorporate these improvements directly—an open pathway for progressive enhancement.
Dynamic environment handling: Although instance masks and dilation suppress moving objects, heavily dynamic scenes present difficulties; research may explore more sophisticated suppression or multi-camera setups.

This suggests that DropD-SLAM defines a new direction for SLAM research, with modular, vision-based pipelines supplanting hardware-centric architectures for robust, scalable, and precise localization and mapping.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to DropD-SLAM.