VINGS-Mono: Visual-Inertial Gaussian Splatting Monocular SLAM in Large Scenes (2501.08286v1)

Published 14 Jan 2025 in cs.RO and cs.CV

Abstract: VINGS-Mono is a monocular (inertial) Gaussian Splatting (GS) SLAM framework designed for large scenes. The framework comprises four main components: VIO Front End, 2D Gaussian Map, NVS Loop Closure, and Dynamic Eraser. In the VIO Front End, RGB frames are processed through dense bundle adjustment and uncertainty estimation to extract scene geometry and poses. Based on this output, the mapping module incrementally constructs and maintains a 2D Gaussian map. Key components of the 2D Gaussian Map include a Sample-based Rasterizer, Score Manager, and Pose Refinement, which collectively improve mapping speed and localization accuracy. This enables the SLAM system to handle large-scale urban environments with up to 50 million Gaussian ellipsoids. To ensure global consistency in large-scale scenes, we design a Loop Closure module, which innovatively leverages the Novel View Synthesis (NVS) capabilities of Gaussian Splatting for loop closure detection and correction of the Gaussian map. Additionally, we propose a Dynamic Eraser to address the inevitable presence of dynamic objects in real-world outdoor scenes. Extensive evaluations in indoor and outdoor environments demonstrate that our approach achieves localization performance on par with Visual-Inertial Odometry while surpassing recent GS/NeRF SLAM methods. It also significantly outperforms all existing methods in terms of mapping and rendering quality. Furthermore, we developed a mobile app and verified that our framework can generate high-quality Gaussian maps in real time using only a smartphone camera and a low-frequency IMU sensor. To the best of our knowledge, VINGS-Mono is the first monocular Gaussian SLAM method capable of operating in outdoor environments and supporting kilometer-scale large scenes.

Summary

The paper presents a novel SLAM framework that integrates visual-inertial odometry with Gaussian Splatting for robust mapping in diverse, large-scale environments.
It achieves state-of-the-art mapping quality by refining pose estimations and handling loop closures using a 2D Gaussian map and novel view synthesis.
The system operates in real time with a monocular camera and inertial sensors, demonstrating scalability with 32.5 million Gaussian ellipsoids over a 3.7 km trajectory.

Overview of VINGS-Mono: Visual-Inertial Gaussian Splatting Monocular SLAM

The paper presents VINGS-Mono, a monocular visual-inertial SLAM framework that leverages Gaussian Splatting (GS) to efficiently handle large-scale scenes. This framework is notable for its ability to map extensive environments using only a monocular camera and inertial measurements, a feat not previously demonstrated in the monocular GS-based SLAM domain.

System Composition and Methodology

VINGS-Mono is structured around four principal components: VIO Front End, 2D Gaussian Map, NVS Loop Closure, and Dynamic Eraser. Each component plays a specific role in enhancing the overall SLAM process.

VIO Front End: This component integrates Visual-Inertial Odometry (VIO) using dense bundle adjustment and uncertainty estimation to derive sensor poses and scene geometry. It serves as the initial stage for capturing and processing the essential data needed for mapping.
2D Gaussian Map: The mapping module employs a Sample-based Rasterizer and a Score Manager to facilitate real-time updates and pruning of Gaussian entities, optimizing both map quality and computational efficiency. The Pose Refinement module further addresses drift by refining the keyframe poses through a novel approach that propagates rendering errors across multiple frames.
NVS Loop Closure: Utilizing Novel View Synthesis (NVS) capabilities, this component is critical for maintaining global map consistency by detecting and rectifying loop closures. This is particularly innovative in its approach to leveraging view synthesis for robust place recognition and correction.
Dynamic Eraser: Recognizing the challenge that dynamic objects pose to accurate mapping, the Dynamic Eraser identifies and mitigates their impact. By using semantic segmentation and re-rendering loss, it effectively masks transient objects, ensuring a more precise static scene reconstruction.

Experimental Evaluation and Findings

The paper provides a comprehensive evaluation across various datasets, including indoor environments such as ScanNet and BundleFusion and challenging outdoor datasets like KITTI and Waymo. VINGS-Mono exhibits localization accuracy on par with or superior to traditional VIO solutions while significantly surpassing existing GS-based SLAM methods, notably in mapping and rendering quality. The framework achieves state-of-the-art mapping quality, even in environments with complex, large-scale dynamics and illumination changes.

In particular, VINGS-Mono demonstrates strong results with a 3.7-kilometer trajectory comprising 32.5 million Gaussian ellipsoids in diverse environments such as driving scenes, urban landscapes, and indoor settings. These capabilities extend to real-time, mobile applications, underscoring the system's practical robustness and utility.

Implications and Future Directions

The development of VINGS-Mono has far-reaching implications for the AI field, particularly in robotics and autonomous navigation. By effectively merging Gaussian Splatting with monocular VIO, this framework opens new avenues for low-cost, high-quality environmental mapping without relying on expensive sensors such as LiDAR. The system's scalability to kilometer-scale scenes marks a notable advancement in SLAM technology, providing a pathway for future research to explore more efficient real-time processing and robust dynamic scene handling in SLAM systems.

Future developments could aim to enhance the robustness of VINGS-Mono under conditions of higher-speed motion and more complex dynamic object interactions. Further optimization might include integrating neural networks for incremental learning of Gaussian updates to improve scalability and processing efficiency. Additionally, exploration into hybrid sensor configurations could optimize performance for various applications in autonomous devices and augmented reality systems.

PDF Markdown

Related Papers

Tweets

https://twitter.com/janusch_patas/status/1879431316162404579

https://twitter.com/OWW/status/1879601243921772913