Dense RGB SLAM with Neural Implicit Maps (2301.08930v2)

Published 21 Jan 2023 in cs.CV, cs.LG, and cs.RO

Abstract: There is an emerging trend of using neural implicit functions for map representation in Simultaneous Localization and Mapping (SLAM). Some pioneer works have achieved encouraging results on RGB-D SLAM. In this paper, we present a dense RGB SLAM method with neural implicit map representation. To reach this challenging goal without depth input, we introduce a hierarchical feature volume to facilitate the implicit map decoder. This design effectively fuses shape cues across different scales to facilitate map reconstruction. Our method simultaneously solves the camera motion and the neural implicit map by matching the rendered and input video frames. To facilitate optimization, we further propose a photometric warping loss in the spirit of multi-view stereo to better constrain the camera pose and scene geometry. We evaluate our method on commonly used benchmarks and compare it with modern RGB and RGB-D SLAM systems. Our method achieves favorable results than previous methods and even surpasses some recent RGB-D SLAM methods.The code is at poptree.github.io/DIM-SLAM/.

Authors (6)

Heng Li (138 papers)
Xiaodong Gu (62 papers)
Weihao Yuan (34 papers)
Luwei Yang (12 papers)
Zilong Dong (34 papers)
Ping Tan (101 papers)

Citations (41)

View on Semantic Scholar

Summary

The paper introduces a novel dense RGB SLAM framework that jointly optimizes camera motion and implicit map reconstruction using hierarchical feature volumes, achieving tracking accuracy of 4.03 cm.
It bypasses the need for depth sensors by employing a neural implicit map decoder, outperforming traditional RGB-D methods like iMAP on standard benchmarks.
Experimental evaluations on Replica, TUM RGB-D, and EuRoC datasets validate its efficacy for AR, VR, and robotics applications.

Dense RGB SLAM with Neural Implicit Maps

This paper presents an advanced approach to Simultaneous Localization and Mapping (SLAM) using dense RGB inputs and neural implicit map representations. The method addresses the challenge of performing dense visual SLAM without relying on depth information and proposes a novel framework that uses neural implicit functions to represent the scene map. This approach differentiates itself from conventional methods by exploiting a hierarchical feature volume that aids in reconstructing the 3D environment using only RGB data. The goal is to deliver a robust and efficient SLAM system that facilitates applications such as augmented reality (AR), virtual reality (VR), and robotics, where dense environmental understanding is pivotal.

Core Methodology

The proposed SLAM system operates with regular RGB cameras, bypassing the need for expensive and scene-constrained RGB-D cameras. The approach integrates a hierarchical feature volume to support the implicit map decoder, which synthesizes various shape cues across scales. This hierarchical design allows the system to accommodate larger scenes compared to previous methods like NICE-SLAM and iMAP, which are bound to room-scale environments due to the limitation of Multilayer Perceptrons (MLPs).

The innovation here lies in the joint optimization of camera motion and the implicit map. The optimization process involves minimizing a rendering loss and a sophisticated photometric warping loss drawn from multi-view stereo methodologies. This loss function ensures that the reconstructed scene geometry is coherent by evaluating the similarity of image patches across different views.

Experimental Evaluation

To assess the efficacy of the proposed SLAM framework, comprehensive evaluations were conducted on standard benchmarking datasets, namely the Replica, TUM RGB-D, and EuRoC datasets. The results reveal that the method effectively surpasses several state-of-the-art RGB-D-based SLAM systems like iMAP in terms of camera tracking. Notably, it achieves commendable outcomes without depth sensors, demonstrating its practicality for broader SLAM applications.

In detail, the proposed method achieves an average accuracy of 4.03 cm in camera tracking across the challenging Replica dataset, competing favorably with approaches that utilize depth information. This RGB-based system marks a significant development by approaching or even exceeding the performance of some modern RGB-D systems in several scenarios.

Contributions and Future Directions

Key contributions of this research include:

The introduction of the first dense RGB SLAM framework utilizing neural implicit maps.
The development of hierarchical feature volumes enhancing occupancy estimates, tackling the inherent depth-less constraints of RGB image input.
Strong experimental performances, highlighted by state-of-the-art results on mapping and tracking benchmarks, even surpassing some RGB-D methods.

Looking ahead, this dense SLAM approach could inspire further exploration of neural implicit representations in real-time applications. Future research may focus on enhancing the system's scalability, optimizing computational efficiency, and refining the accuracy of scene reconstructions under varying environmental conditions. As AI and machine learning technologies advance, this framework lays a promising foundation for more universally applicable SLAM systems that integrate seamlessly into diverse real-world applications without needing specialized sensors.

PDF Markdown