DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras (2108.10869v2)

Published 24 Aug 2021 in cs.CV

Abstract: We introduce DROID-SLAM, a new deep learning based SLAM system. DROID-SLAM consists of recurrent iterative updates of camera pose and pixelwise depth through a Dense Bundle Adjustment layer. DROID-SLAM is accurate, achieving large improvements over prior work, and robust, suffering from substantially fewer catastrophic failures. Despite training on monocular video, it can leverage stereo or RGB-D video to achieve improved performance at test time. The URL to our open source code is https://github.com/princeton-vl/DROID-SLAM.

Citations (453)

View on Semantic Scholar

Summary

The paper introduces a deep SLAM method that integrates dense bundle adjustment with iterative updates to drastically reduce tracking errors across benchmarks.
The paper employs robust recurrent optimization to achieve zero catastrophic failures on multiple datasets.
The paper demonstrates strong generalization by training on synthetic monocular data while seamlessly handling stereo and RGB-D inputs without retraining.

DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras

The paper introduces DROID-SLAM, a deep learning-based approach to Simultaneous Localization and Mapping (SLAM) that leverages recurrent iterative updates through a Dense Bundle Adjustment (DBA) layer. This method integrates camera pose estimation and pixelwise depth retrieval to handle image data from monocular, stereo, and RGB-D cameras. DROID-SLAM exhibits significant advancements in accuracy and robustness over existing SLAM solutions.

Key Contributions and Performance

High Accuracy: DROID-SLAM provides substantial improvements across multiple benchmarks. On the TartanAir SLAM competition, it reduces error by 62% on the monocular track and 60% on the stereo track. It achieves first place on the ETH-3D RGB-D SLAM leaderboard, outperforming the second place by 35% under the AUC metric. It also shows an 82% error reduction compared to other zero-failure methods on EuRoC with monocular input and 71% with stereo. On TUM-RGBD, an error reduction of 83% is observed among methods with zero failures.
High Robustness: The system demonstrates significantly fewer catastrophic failures. On ETH-3D, it successfully tracks 30 of the 32 RGB-D datasets, whereas the next best tracks only 19/32. On TartanAir, EuRoC, and TUM-RGBD, the system shows zero failures.
Strong Generalization: DROID-SLAM, trained on monocular input, efficiently utilizes stereo or RGB-D input without retraining. This single model accurately handles data across four datasets and three modalities, trained solely on synthetic monocular video from the TartanAir dataset.

Methodological Insights

DROID-SLAM's architecture is highlighted by its "Differentiable Recurrent Optimization-Inspired Design" (DROID). This design is characterized by its end-to-end differentiable nature, amalgamating the strengths of classical optimization techniques with deep learning approaches. It draws inspiration from the RAFT algorithm, expanding the iterative update of optical flow to encompass camera poses and depth.

Iterative Updates: Unlike RAFT, DROID-SLAM updates camera poses and depth for multiple frames, crucial for reducing drift over long trajectories and enabling effective loop closures.
Dense Bundle Adjustment Layer: This layer computes Gauss-Newton updates to optimize camera poses and dense per-pixel depth. It leverages geometric constraints to enhance robustness and accuracy, facilitating the handling of stereo and RGB-D input.

Distinct Comparisons

The paper positions DROID-SLAM against prior deep architectures like DeepV2D and BA-Net. While DeepV2D alternates between depth and camera pose updates, DROID-SLAM integrates a dense bundle adjustment approach. Unlike BA-Net, which employs a limited optimization over a small set of coefficients, DROID-SLAM optimizes pixelwise depth directly, enhancing generalization by not tying its depth representation to the training dataset.

Implications and Future Directions

DROID-SLAM's performance signifies a robust step forward in SLAM technology, particularly in environments requiring high accuracy and robustness against failures. Its design allows it to process and adapt to new input configurations with a high degree of precision, suggesting potential applications in autonomous vehicles and robotics.

Future research may expand upon DROID-SLAM's adaptable architecture to include more diverse sensor inputs and further optimize computational efficiency. The integration of additional deep learning paradigms could also be explored to enhance its real-time processing capabilities. Moreover, as VR and AR technologies advance, optimizing SLAM systems for such applications might become a pertinent area of exploration.

PDF Markdown

Related Papers

GitHub

GitHub - princeton-vl/DROID-SLAM (1,793 stars)