NeRF-VO: Real-Time Sparse Visual Odometry with Neural Radiance Fields (2312.13471v2)

Published 20 Dec 2023 in cs.CV

Abstract: We introduce a novel monocular visual odometry (VO) system, NeRF-VO, that integrates learning-based sparse visual odometry for low-latency camera tracking and a neural radiance scene representation for fine-detailed dense reconstruction and novel view synthesis. Our system initializes camera poses using sparse visual odometry and obtains view-dependent dense geometry priors from a monocular prediction network. We harmonize the scale of poses and dense geometry, treating them as supervisory cues to train a neural implicit scene representation. NeRF-VO demonstrates exceptional performance in both photometric and geometric fidelity of the scene representation by jointly optimizing a sliding window of keyframed poses and the underlying dense geometry, which is accomplished through training the radiance field with volume rendering. We surpass SOTA methods in pose estimation accuracy, novel view synthesis fidelity, and dense reconstruction quality across a variety of synthetic and real-world datasets while achieving a higher camera tracking frequency and consuming less GPU memory.

References (60)

Authors (4)

Jens Naumann (1 paper)
Binbin Xu (37 papers)
Stefan Leutenegger (66 papers)
Xingxing Zuo (36 papers)

Citations (11)

View on Semantic Scholar

Summary

The paper introduces NeRF-VO, a system that combines sparse visual odometry with neural radiance fields to enable real-time 3D scene reconstruction using a single RGB camera.
The method efficiently tracks camera poses using key visual landmarks, achieving high-frequency pose estimates with low latency through multi-threaded processing.
Experimental results show improved 3D mapping accuracy, reduced GPU memory usage, and promise for diverse applications from robotics to augmented reality.

Overview

In recent advances in computer vision, particularly in 3D scene reconstruction and camera tracking, a new system called NeRF-VO has demonstrated notable achievements. The system is monocular, meaning it only requires a single camera to operate, and it combines visual odometry with the power of Neural Radiance Fields (NeRF) to create a highly accurate mapping of an environment. It works by using visuals from a standard RGB camera and processes this information through innovative machine learning techniques to track the camera's movement and create an intricate 3D model of the surroundings.

Visual Odometry and Tracking

NeRF-VO introduces an efficient method for tracking camera positions through what's known as sparse visual odometry. This technique identifies key points in the visual field, then uses their movements across successive camera frames to estimate the camera's trajectory and orientation with low latency. This part of the system is termed the sparse visual tracking front-end, owing to its focus on using these distinct landmarks. This front-end is particularly adept at delivering high-frequency pose estimations, which are crucial for real-time applications.

Dense Geometry and Neural Mapping

The breakthroughs do not stop with visual tracking; NeRF-VO incorporates a depth prediction network. It uses this to generate dense geometric priors, including depth maps and surface normals, from single RGB images. These priors are then scaled and aligned to the sparse landmarks for cohesive scene understanding.

Integral to its design is the neural implicit scene representation - a NeRF that's been tailored for real-time 3D reconstruction. By optimizing a sliding window of keyfram poses and dense geometry, the system crafts a detailed and photorealistic 3D map of the environment.

Real-Time Capability and Efficiency

One of the striking features of NeRF-VO is its real-time operational capability. It manages to process information quickly enough to be used in live applications, unlike some other systems which lag due to computational demands. The key to this lies in its multi-threaded architecture which allows various components of the system – the sparse tracking, dense geometry enhancement, and dense mapping modules – to run simultaneously and independently. This parallel processing contributes to its speed and efficiency.

Results and Performance

When put to test against other state-of-the-art methods, NeRF-VO excels in accuracy for 3D reconstruction, poses estimation, and even in generating novel views within a captured scene. It also outperforms competitors in terms of the low latency of camera tracking and minimal GPU memory usage. This makes it an appealing option not only for robotic navigation and augmented reality scenarios but also for applications that require precise 3D models from visual data, such as in architecture and heritage preservation.

Concluding Thoughts

The integration of NeRF into the SLAM pipeline with a system like NeRF-VO shows promising directions for future enhancements in visual mapping technologies. The capability it offers for detailed, real-time mapping with a single camera opens new frontiers for automation and spatial understanding applications, ensuring that as environments and situations evolve, so too will our capability to capture and interact with them digitally.

PDF Markdown

Tweets

https://twitter.com/1565330182176911367/status/1738097229473439907