MUTE-SLAM: Real-Time Neural SLAM with Multiple Tri-Plane Hash Representations (2403.17765v3)

Published 26 Mar 2024 in cs.CV

Abstract: We introduce MUTE-SLAM, a real-time neural RGB-D SLAM system employing multiple tri-plane hash-encodings for efficient scene representation. MUTE-SLAM effectively tracks camera positions and incrementally builds a scalable multi-map representation for both small and large indoor environments. As previous methods often require pre-defined scene boundaries, MUTE-SLAM dynamically allocates sub-maps for newly observed local regions, enabling constraint-free mapping without prior scene information. Unlike traditional grid-based methods, we use three orthogonal axis-aligned planes for hash-encoding scene properties, significantly reducing hash collisions and the number of trainable parameters. This hybrid approach not only ensures real-time performance but also enhances the fidelity of surface reconstruction. Furthermore, our optimization strategy concurrently optimizes all sub-maps intersecting with the current camera frustum, ensuring global consistency. Extensive testing on both real-world and synthetic datasets has shown that MUTE-SLAM delivers state-of-the-art surface reconstruction quality and competitive tracking performance across diverse indoor settings. The code is available at https://github.com/lumennYan/MUTE_SLAM.

Citations (1)

View on Semantic Scholar

Summary

The paper introduces a novel real-time neural SLAM method that leverages multiple tri-plane hash representations to efficiently map diverse indoor environments.
It achieves high-fidelity surface reconstruction and robust camera tracking by dynamically allocating sub-maps and reducing hash collisions.
It employs a global optimization strategy that concurrently refines all intersecting sub-maps, ensuring scalable and globally consistent 3D scene reconstructions.

Real-Time Neural SLAM with Multiple Tri-Plane Hash Representations: MUTE-SLAM

Overview of MUTE-SLAM

MUTE-SLAM introduces a novel real-time neural RGB-D SLAM system that leverages multiple tri-plane hash-encodings for efficient scene representation. This system stands out by tracking camera positions and incrementally building a scalable multi-map representation suitable for various indoor environments. It dynamically allocates sub-maps for newly observed local regions, which allows constraint-free mapping without prior scene information. The combination of three orthogonal axis-aligned planes for hash-encoding scene properties marks a departure from traditional grid-based methods, leading to a significant reduction in hash collisions and the number of trainable parameters. This architecture ensures real-time performance and improves the fidelity of surface reconstruction. An optimization strategy concurrently optimizes all sub-maps intersecting with the current camera frustum to ensure global consistency. Extensive tests on both real-world and synthetic datasets showcase MUTE-SLAM's state-of-the-art surface reconstruction quality and competitive tracking performance across diverse indoor settings.

Key Contributions

MUTE-SLAM makes several notable contributions to the field of 3D computer vision and SLAM:

A multi-map-based scene representation that facilitates reconstruction scalable to diverse indoor scenarios without the need for predefined boundaries.
A tri-plane hash-encoding method for sub-maps which enables real-time tracking and anti-aliasing dense mapping with high-fidelity details.
An optimization strategy that jointly optimizes all sub-maps observed concurrently, ensuring global consistency.
Extensive experimental validation on various datasets, demonstrating the system's scalability and effectiveness in both tracking and mapping.

Related Works

Traditional dense SLAM methods and learning-based approaches have made significant contributions towards achieving accurate localization results and detailed 3D point positions. However, integrating Neural Radiance Fields (NeRF) and neural implicit representations into SLAM systems has emerged as a promising direction due to their capabilities in rendering novel views and reconstructing 3D surfaces. While existing NeRF-based SLAM techniques have showcased their applicability across various scenes, their computational demands pose challenges to real-time application. MUTE-SLAM addresses these limitations by proposing a system that ensures efficient and detail-preserving scene reconstruction while facilitating scalability and real-time performance.

Methodology

MUTE-SLAM's methodology revolves around its multi-map scene representation, tri-plane hash-encoding, TSDF-based volume rendering, and optimization processes for tracking and mapping. The multi-map representation allows the system to adapt to environments of any size dynamically. The tri-plane hash-encoding efficiently reduces scene parameters while preserving geometric and color information, mitigating hash collisions. The volume rendering process utilizes TSDF and color values to render depth and color images, which are then used to optimize sub-maps and camera poses. The global bundle adjustment mechanism further refines all trainable parameters and ensures global consistency.

Experimental Evaluation and Performance Analysis

MUTE-SLAM was extensively evaluated against state-of-the-art NeRF-based dense SLAM approaches on various benchmarks, including synthetic scenes and real-world datasets. The system demonstrated superior surface reconstruction quality, detailed geometry capture, and robust camera tracking performance. The method also showed faster performance and lower memory consumption compared to other techniques, particularly in large-scale environments.

Conclusion and Future Directions

MUTE-SLAM represents a significant advancement in the field of SLAM, offering a scalable and efficient solution for real-time dense mapping and tracking across a wide range of indoor environments. Its innovative use of multiple tri-plane hash-encodings and a multi-map-based scene representation addresses key challenges faced by existing systems. Future work could explore improvements in scene understanding and semantic mapping, as well as adaptations for outdoor environments and more challenging conditions such as dynamic lighting and moving objects.

PDF Markdown

Related Papers

Tweets

https://twitter.com/zhenjun_zhao/status/1772986802376266022

https://twitter.com/knishimae0531/status/1773144084627997137