Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

NeRF-VO: Real-Time Sparse Visual Odometry with Neural Radiance Fields (2312.13471v2)

Published 20 Dec 2023 in cs.CV

Abstract: We introduce a novel monocular visual odometry (VO) system, NeRF-VO, that integrates learning-based sparse visual odometry for low-latency camera tracking and a neural radiance scene representation for fine-detailed dense reconstruction and novel view synthesis. Our system initializes camera poses using sparse visual odometry and obtains view-dependent dense geometry priors from a monocular prediction network. We harmonize the scale of poses and dense geometry, treating them as supervisory cues to train a neural implicit scene representation. NeRF-VO demonstrates exceptional performance in both photometric and geometric fidelity of the scene representation by jointly optimizing a sliding window of keyframed poses and the underlying dense geometry, which is accomplished through training the radiance field with volume rendering. We surpass SOTA methods in pose estimation accuracy, novel view synthesis fidelity, and dense reconstruction quality across a variety of synthetic and real-world datasets while achieving a higher camera tracking frequency and consuming less GPU memory.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5470–5479, 2022.
  2. Codeslam — learning a compact, optimisable representation for dense visual slam. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  3. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam. IEEE Transactions on Robotics, 37(6):1874–1890, 2021.
  4. Tensorf: Tensorial radiance fields. In Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXII, page 333–350, Berlin, Heidelberg, 2022. Springer-Verlag.
  5. Orbeez-slam: A real-time monocular visual slam with orb features and nerf-realized mapping. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 9400–9406, 2023.
  6. Soft2: Stereo visual odometry for road vehicles based on a point-to-epipolar-line metric. IEEE Transactions on Robotics, 39(1):273–288, 2023.
  7. Deepfactors: Real-time probabilistic dense monocular slam. IEEE Robotics and Automation Letters, 5(2):721–728, 2020.
  8. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017a.
  9. Bundlefusion: Real-time globally consistent 3d reconstruction using on-the-fly surface reintegration. ACM Trans. Graph., 36(3), 2017b.
  10. Monoslam: Real-time single camera slam. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(6):1052–1067, 2007.
  11. Depth-supervised nerf: Fewer views and faster training for free. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12882–12891, 2022.
  12. Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10786–10796, 2021.
  13. Lsd-slam: Large-scale direct monocular slam. In Computer Vision – ECCV 2014, pages 834–849, Cham, 2014. Springer International Publishing.
  14. Direct sparse odometry. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(3):611–625, 2018.
  15. Svo: Fast semi-direct monocular visual odometry. In 2014 IEEE International Conference on Robotics and Automation (ICRA), pages 15–22, 2014.
  16. Plenoxels: Radiance fields without neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5501–5510, 2022.
  17. W. Kabsch. A discussion of the solution for the best rotation to relate two sets of vectors. Acta Crystallographica Section A, 34(5):827–828, 1978.
  18. Parallel tracking and mapping for small AR workspaces. In Sixth IEEE/ACM International Symposium on Mixed and Augmented Reality, ISMAR 2007, 13-16 November 2007, Nara, Japan, pages 225–234. IEEE Computer Society, 2007.
  19. TANDEM: tracking and dense mapping in real-time using deep multi-view stereo. In Conference on Robot Learning, 8-11 November 2021, London, UK, pages 34–45. PMLR, 2021.
  20. Stefan Leutenegger. Okvis2: Realtime scalable visual-inertial slam with loop closure, 2022.
  21. Dense rgb slam with neural implicit maps. In Proceedings of the International Conference on Learning Representations, 2023.
  22. Towards open world nerf-based slam. In 2023 20th Conference on Robots and Vision (CRV), pages 37–44, Los Alamitos, CA, USA, 2023. IEEE Computer Society.
  23. Newton: Neural view-centric mapping for on-the-fly large-scale slam, 2023.
  24. Feature-realistic neural fusion for real-time, open set scene understanding. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 8201–8207, 2023.
  25. Nerf: Representing scenes as neural radiance fields for view synthesis. In Computer Vision – ECCV 2020, pages 405–421, Cham, 2020. Springer International Publishing.
  26. Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph., 41(4), 2022.
  27. Kinectfusion: Real-time dense surface mapping and tracking. In 2011 10th IEEE International Symposium on Mixed and Augmented Reality, pages 127–136, 2011a.
  28. Dtam: Dense tracking and mapping in real-time. In 2011 International Conference on Computer Vision, pages 2320–2327, 2011b.
  29. Diffposenet: Direct differentiable camera pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6845–6854, 2022.
  30. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 12179–12188, 2021.
  31. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(3):1623–1637, 2022.
  32. Nerf-slam: Real-time dense monocular slam with neural radiance fields, 2022.
  33. Efficient variants of the icp algorithm. In Proceedings Third International Conference on 3-D Digital Imaging and Modeling, pages 145–152, 2001.
  34. Point-slam: Dense neural point cloud-based slam. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 18433–18444, 2023.
  35. Scene coordinate regression forests for camera relocalization in rgb-d images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013.
  36. The replica dataset: A digital replica of indoor spaces, 2019.
  37. A benchmark for the evaluation of rgb-d slam systems. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 573–580, 2012.
  38. imap: Implicit mapping and positioning in real-time. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6229–6238, 2021.
  39. Nerfstudio: A modular framework for neural radiance field development. In ACM SIGGRAPH 2023 Conference Proceedings, New York, NY, USA, 2023. Association for Computing Machinery.
  40. Cnn-slam: Real-time dense monocular slam with learned depth prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  41. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras. In Advances in Neural Information Processing Systems, pages 16558–16569. Curran Associates, Inc., 2021.
  42. Deep patch visual odometry, 2023.
  43. S. Umeyama. Least-squares estimation of transformation parameters between two point patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(4):376–380, 1991.
  44. Demon: Depth and motion network for learning monocular stereo. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  45. Co-slam: Joint coordinate and sparse parametric encodings for neural real-time slam. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13293–13302, 2023.
  46. Tartanvo: A generalizable learning-based vo. In Proceedings of the 2020 Conference on Robot Learning, pages 1761–1772. PMLR, 2021.
  47. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612, 2004.
  48. Monocular visual-inertial depth estimation. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 6095–6101, 2023.
  49. Simplemapping: Real-time visual-inertial dense mapping with deep multi-view stereo, 2023.
  50. D3vo: Deep depth, deep pose and deep uncertainty for monocular visual odometry. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  51. Vox-fusion: Dense tracking and mapping with voxel-based neural implicit representation. In 2022 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), pages 499–507, 2022.
  52. Monosdf: Exploring monocular geometric cues for neural implicit surface reconstruction. In Advances in Neural Information Processing Systems, pages 25018–25032. Curran Associates, Inc., 2022.
  53. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  54. Hi-slam: Monocular real-time dense mapping with hybrid implicit fields, 2023a.
  55. Go-slam: Global optimization for consistent 3d instant reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023b.
  56. Deeptam: Deep tracking and mapping. In Proceedings of the European Conference on Computer Vision (ECCV), 2018.
  57. Nice-slam: Neural implicit scalable encoding for slam. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12786–12796, 2022.
  58. Nicer-slam: Neural implicit scene encoding for rgb slam, 2023.
  59. Direct sparse mapping. IEEE Transactions on Robotics, 36(4):1363–1370, 2020.
  60. Codevio: Visual-inertial odometry with learned optimizable dense depth. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 14382–14388, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Jens Naumann (1 paper)
  2. Binbin Xu (37 papers)
  3. Stefan Leutenegger (66 papers)
  4. Xingxing Zuo (36 papers)
Citations (11)

Summary

  • The paper introduces NeRF-VO, a system that combines sparse visual odometry with neural radiance fields to enable real-time 3D scene reconstruction using a single RGB camera.
  • The method efficiently tracks camera poses using key visual landmarks, achieving high-frequency pose estimates with low latency through multi-threaded processing.
  • Experimental results show improved 3D mapping accuracy, reduced GPU memory usage, and promise for diverse applications from robotics to augmented reality.

Overview

In recent advances in computer vision, particularly in 3D scene reconstruction and camera tracking, a new system called NeRF-VO has demonstrated notable achievements. The system is monocular, meaning it only requires a single camera to operate, and it combines visual odometry with the power of Neural Radiance Fields (NeRF) to create a highly accurate mapping of an environment. It works by using visuals from a standard RGB camera and processes this information through innovative machine learning techniques to track the camera's movement and create an intricate 3D model of the surroundings.

Visual Odometry and Tracking

NeRF-VO introduces an efficient method for tracking camera positions through what's known as sparse visual odometry. This technique identifies key points in the visual field, then uses their movements across successive camera frames to estimate the camera's trajectory and orientation with low latency. This part of the system is termed the sparse visual tracking front-end, owing to its focus on using these distinct landmarks. This front-end is particularly adept at delivering high-frequency pose estimations, which are crucial for real-time applications.

Dense Geometry and Neural Mapping

The breakthroughs do not stop with visual tracking; NeRF-VO incorporates a depth prediction network. It uses this to generate dense geometric priors, including depth maps and surface normals, from single RGB images. These priors are then scaled and aligned to the sparse landmarks for cohesive scene understanding.

Integral to its design is the neural implicit scene representation - a NeRF that's been tailored for real-time 3D reconstruction. By optimizing a sliding window of keyfram poses and dense geometry, the system crafts a detailed and photorealistic 3D map of the environment.

Real-Time Capability and Efficiency

One of the striking features of NeRF-VO is its real-time operational capability. It manages to process information quickly enough to be used in live applications, unlike some other systems which lag due to computational demands. The key to this lies in its multi-threaded architecture which allows various components of the system – the sparse tracking, dense geometry enhancement, and dense mapping modules – to run simultaneously and independently. This parallel processing contributes to its speed and efficiency.

Results and Performance

When put to test against other state-of-the-art methods, NeRF-VO excels in accuracy for 3D reconstruction, poses estimation, and even in generating novel views within a captured scene. It also outperforms competitors in terms of the low latency of camera tracking and minimal GPU memory usage. This makes it an appealing option not only for robotic navigation and augmented reality scenarios but also for applications that require precise 3D models from visual data, such as in architecture and heritage preservation.

Concluding Thoughts

The integration of NeRF into the SLAM pipeline with a system like NeRF-VO shows promising directions for future enhancements in visual mapping technologies. The capability it offers for detailed, real-time mapping with a single camera opens new frontiers for automation and spatial understanding applications, ensuring that as environments and situations evolve, so too will our capability to capture and interact with them digitally.

X Twitter Logo Streamline Icon: https://streamlinehq.com