Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

NeSLAM: Neural Implicit Mapping and Self-Supervised Feature Tracking With Depth Completion and Denoising (2403.20034v1)

Published 29 Mar 2024 in cs.CV and cs.RO

Abstract: In recent years, there have been significant advancements in 3D reconstruction and dense RGB-D SLAM systems. One notable development is the application of Neural Radiance Fields (NeRF) in these systems, which utilizes implicit neural representation to encode 3D scenes. This extension of NeRF to SLAM has shown promising results. However, the depth images obtained from consumer-grade RGB-D sensors are often sparse and noisy, which poses significant challenges for 3D reconstruction and affects the accuracy of the representation of the scene geometry. Moreover, the original hierarchical feature grid with occupancy value is inaccurate for scene geometry representation. Furthermore, the existing methods select random pixels for camera tracking, which leads to inaccurate localization and is not robust in real-world indoor environments. To this end, we present NeSLAM, an advanced framework that achieves accurate and dense depth estimation, robust camera tracking, and realistic synthesis of novel views. First, a depth completion and denoising network is designed to provide dense geometry prior and guide the neural implicit representation optimization. Second, the occupancy scene representation is replaced with Signed Distance Field (SDF) hierarchical scene representation for high-quality reconstruction and view synthesis. Furthermore, we also propose a NeRF-based self-supervised feature tracking algorithm for robust real-time tracking. Experiments on various indoor datasets demonstrate the effectiveness and accuracy of the system in reconstruction, tracking quality, and novel view synthesis.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. R. Mur-Artal and J. D. Tardós, “Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras,” IEEE Transactions on Robotics, vol. 33, no. 5, pp. 1255–1262, 2017.
  2. T. Deng, H. Xie, J. Wang, and W. Chen, “Long-term visual simultaneous localization and mapping: Using a bayesian persistence filter-based global map prediction,” IEEE Robotics & Automation Magazine, vol. 30, no. 1, pp. 36–49, 2023.
  3. H. Xie, T. Deng, J. Wang, and W. Chen, “Robust incremental long-term visual topological localization in changing environments,” IEEE Transactions on Instrumentation and Measurement, vol. 72, pp. 1–14, 2022.
  4. ——, “Angular tracking consistency guided fast feature association for visual-inertial slam,” IEEE Transactions on Instrumentation and Measurement, 2024.
  5. R. A. Newcombe, S. J. Lovegrove, and A. J. Davison, “Dtam: Dense tracking and mapping in real-time,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2011, pp. 2320–2327.
  6. T. Schops, T. Sattler, and M. Pollefeys, “Bad slam: Bundle adjusted direct rgb-d slam,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 134–144.
  7. H. Matsuki, R. Scona, J. Czarnowski, and A. J. Davison, “Codemapping: Real-time dense mapping for sparse slam using compact scene representations,” IEEE Robotics and Automation Letters, vol. 6, no. 4, pp. 7105–7112, 2021.
  8. M. U. M. Bhutta, M. Kuse, R. Fan, Y. Liu, and M. Liu, “Loop-box: Multiagent direct slam triggered by single loop closure for large-scale mapping,” IEEE Transactions on Cybernetics, vol. 52, no. 6, pp. 5088–5097, 2022.
  9. J. Liu, R. Yu, Y. Wang, Y. Zheng, T. Deng, W. Ye, and H. Wang, “Point mamba: A novel point cloud backbone based on state space model with octree-based ordering strategy,” arXiv preprint arXiv:2403.06467, 2024.
  10. X. Liu, Z. Lin, Y. Niu, Z. Lyu, Q. Xu, B. Cui, and T. Deng, “A multi-uav cooperative search system design based on man-in-the-loop,” in 2020 3rd International Conference on Unmanned Systems (ICUS).   IEEE, 2020, pp. 757–762.
  11. T. Deng, “Research on aerial robot based on visual servo,” in Journal of Physics: Conference Series, vol. 1678, no. 1.   IOP Publishing, 2020, p. 012007.
  12. M. Bloesch, J. Czarnowski, R. Clark, S. Leutenegger, and A. J. Davison, “Codeslam—learning a compact, optimisable representation for dense visual slam,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2560–2568.
  13. S. Zhi, M. Bloesch, S. Leutenegger, and A. J. Davison, “Scenecode: Monocular dense semantic reconstruction using learned encoded scene representations,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 11 776–11 785.
  14. Z. Teed and J. Deng, “Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras,” Advances in neural information processing systems, vol. 34, pp. 16 558–16 569, 2021.
  15. B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” in European Conference on Computer Vision, 2020.
  16. T. Deng, S. Liu, X. Wang, Y. Liu, D. Wang, and W. Chen, “Prosgnerf: Progressive dynamic neural scene graph with frequency modulated auto-encoder in urban scenes,” arXiv preprint arXiv:2312.09076, 2023.
  17. E. Sucar, S. Liu, J. Ortiz, and A. J. Davison, “imap: Implicit mapping and positioning in real-time,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, October 2021, pp. 6229–6238.
  18. Z. Zhu, S. Peng, V. Larsson, W. Xu, H. Bao, Z. Cui, M. R. Oswald, and M. Pollefeys, “Nice-slam: Neural implicit scalable encoding for slam,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2022, pp. 12 786–12 796.
  19. T. Deng, G. Shen, T. Qin, J. Wang, W. Zhao, J. Wang, D. Wang, and W. Chen, “Plgslam: Progressive neural scene represenation with local to global bundle adjustment,” arXiv preprint arXiv:2312.09866, 2023.
  20. M. Li, S. Liu, and H. Zhou, “Sgs-slam: Semantic gaussian splatting for neural dense slam,” arXiv preprint arXiv:2402.03246, 2024.
  21. M. Li, J. He, G. Jiang, and H. Wang, “Ddn-slam: Real-time dense dynamic neural implicit slam with joint semantic encoding,” arXiv preprint arXiv:2401.01545, 2024.
  22. T. Deng, Y. Chen, L. Zhang, J. Yang, S. Yuan, D. Wang, and W. Chen, “Compact 3d gaussian splatting for dense visual slam,” arXiv preprint arXiv:2403.11247, 2024.
  23. G. Klein and D. Murray, “Parallel tracking and mapping for small ar workspaces,” in Proceedings of the IEEE/ACM International Conference on Symposium on Mixed and Augmented Reality, 2007, pp. 225–234.
  24. T. Qin, P. Li, and S. Shen, “Vins-mono: A robust and versatile monocular visual-inertial state estimator,” IEEE Transactions on Robotics, vol. 34, no. 4, pp. 1004–1020, 2018.
  25. B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg, A. Dosovitskiy, and T. Brox, “Demon: Depth and motion network for learning monocular stereo,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, July 2017.
  26. S. Izadi, D. Kim, O. Hilliges, D. Molyneaux, R. Newcombe, P. Kohli, J. Shotton, S. Hodges, D. Freeman, A. Davison et al., “Kinectfusion: real-time 3d reconstruction and interaction using a moving depth camera,” in Proceedings of the 24th annual ACM symposium on User interface software and technology, 2011, pp. 559–568.
  27. A. Dai, M. Nießner, M. Zollhöfer, S. Izadi, and C. Theobalt, “Bundlefusion: Real-time globally consistent 3d reconstruction using on-the-fly surface reintegration,” ACM Transactions on Graphics (ToG), vol. 36, no. 4, 2017.
  28. C. Tang and P. Tan, “Ba-net: Dense bundle adjustment network,” in Proceedings of the International Conference on Learning Representations, September 2018.
  29. X. Gao, X. Liu, Z. Cao, M. Tan, and J. Yu, “Dynamic rigid bodies mining and motion estimation based on monocular camera,” IEEE Transactions on Cybernetics, pp. 1–12, 2022.
  30. R. Fan, U. Ozgunalp, Y. Wang, M. Liu, and I. Pitas, “Rethinking road surface 3-d reconstruction and pothole detection: From perspective transformation to disparity map segmentation,” IEEE Transactions on Cybernetics, vol. 52, no. 7, pp. 5799–5808, 2022.
  31. S. Zhao, X. Wang, D. Zhang, G. Zhang, Z. Wang, and H. Liu, “Fm-3dfr: Facial manipulation-based 3-d face reconstruction,” IEEE Transactions on Cybernetics, pp. 1–10, 2023.
  32. M. Oechsle, S. Peng, and A. Geiger, “Unisurf: Unifying neural implicit surfaces and radiance fields for multi-view reconstruction,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, October 2021, pp. 5589–5599.
  33. P. Wang, L. Liu, Y. Liu, C. Theobalt, T. Komura, and W. Wang, “Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction,” in Advances in Neural Information Processing Systems, vol. 34, 2021, pp. 27 171–27 183.
  34. D. Azinović, R. Martin-Brualla, D. B. Goldman, M. Nießner, and J. Thies, “Neural rgb-d surface reconstruction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2022, pp. 6290–6301.
  35. A. Bozic, P. Palafox, J. Thies, A. Dai, and M. Niessner, “Transformerfusion: Monocular rgb scene reconstruction using transformers,” in Advances in Neural Information Processing Systems, vol. 34, 2021, pp. 1403–1414.
  36. J. Choe, S. Im, F. Rameau, M. Kang, and I. S. Kweon, “Volumefusion: Deep depth fusion for 3d scene reconstruction,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, October 2021, pp. 16 086–16 095.
  37. J. Sun, Y. Xie, L. Chen, X. Zhou, and H. Bao, “Neuralrecon: Real-time coherent 3d reconstruction from monocular video,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2021, pp. 15 598–15 607.
  38. C. Xia, Y. Shen, Y. Yang, X. Deng, S. Chen, J. Xin, and N. Zheng, “Onboard sensors-based self-localization for autonomous vehicle with hierarchical map,” IEEE Transactions on Cybernetics, vol. 53, no. 7, pp. 4218–4231, 2023.
  39. L. Yen-Chen, P. Florence, J. T. Barron, A. Rodriguez, P. Isola, and T.-Y. Lin, “inerf: Inverting neural radiance fields for pose estimation,” in Proceeding of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 2021, pp. 1323–1330.
  40. Z. Wang, S. Wu, W. Xie, M. Chen, and V. A. Prisacariu, “Nerf–: Neural radiance fields without known camera parameters,” arXiv preprint arXiv:2102.07064, 2021.
  41. C.-H. Lin, W.-C. Ma, A. Torralba, and S. Lucey, “Barf: Bundle-adjusting neural radiance fields,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, October 2021, pp. 5741–5751.
  42. B. Roessle, J. T. Barron, B. Mildenhall, P. P. Srinivasan, and M. Nießner, “Dense depth priors for neural radiance fields from sparse input views,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2022, pp. 12 892–12 901.
  43. X. Cheng, P. Wang, and R. Yang, “Learning depth with convolutional spatial propagation network,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 10, pp. 2361–2379, 2020.
  44. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 2016.
  45. O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Proceeding of the International Conference in Medical Image Computing and Computer-Assisted Intervention, 2015, pp. 234–241.
  46. J. Park, K. Joo, Z. Hu, C.-K. Liu, and I. So Kweon, “Non-local spatial propagation network for depth completion,” in Proceedings of the European Conference on Computer Vision, 2020, pp. 120–136.
  47. J. Straub, T. Whelan, L. Ma, Y. Chen, E. Wijmans, S. Green, J. J. Engel, R. Mur-Artal, C. Ren, S. Verma et al., “The replica dataset: A digital replica of indoor spaces,” arXiv preprint arXiv:1906.05797, 2019.
  48. A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Niessner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, July 2017.
  49. J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “A benchmark for the evaluation of rgb-d slam systems,” in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 2012, pp. 573–580.
  50. L. Yariv, J. Gu, Y. Kasten, and Y. Lipman, “Volume rendering of neural implicit surfaces,” in Advances in Neural Information Processing Systems, vol. 34, 2021, pp. 4805–4815.
  51. M. Tancik, P. Srinivasan, B. Mildenhall, and Fridovich-Keil, “Fourier features let networks learn high frequency functions in low dimensional domains,” in Advances in Neural Information Processing Systems, vol. 33, 2020, pp. 7537–7547.
  52. D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superpoint: Self-supervised interest point detection and description,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, June 2018.
  53. T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Proceedings of the European Conference on Computer Vision, 2014, pp. 740–755.
  54. A. Gropp, L. Yariv, N. Haim, M. Atzmon, and Y. Lipman, “Implicit geometric regularization for learning shapes,” in Proceeding of the International Conference on Machine Learning, 2020, pp. 3789–3799.
  55. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proceedings of the International Conference on Learning Representations, 2015.
  56. J. Huang, S.-S. Huang, H. Song, and S.-M. Hu, “Di-fusion: Online implicit 3d reconstruction with deep priors,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 8932–8941.
  57. T. Whelan, S. Leutenegger, R. Salas-Moreno, B. Glocker, and A. Davison, “Elasticfusion: Dense slam without a pose graph.”   Robotics: Science and Systems, 2015.
Citations (17)

Summary

  • The paper introduces a neural implicit mapping framework that integrates depth completion, denoising, and an SDF-based scene representation to enhance SLAM performance.
  • The paper employs a NeRF-based self-supervised feature tracking algorithm that refines camera localization in large, complex indoor environments.
  • The paper demonstrates superior reconstruction accuracy and novel view synthesis compared to conventional methods, setting a new benchmark for RGB-D SLAM.

NeSLAM: Enhancing Dense RGB-D SLAM with Neural Implicit Mapping and Self-Supervised Feature Tracking

Introduction

Simultaneous Localization and Mapping (SLAM) systems are paramount in the field of robotics and virtual reality, enabling devices to understand and navigate through complex environments. The advent of Neural Radiance Fields (NeRF) has opened new avenues for achieving more detailed and accurate 3D reconstructions by leveraging implicit neural representations. Despite the promising results, challenges persist, particularly with depth images from consumer-grade RGB-D sensors that are often sparse, noisy, and thus detrimental to the fidelity of 3D reconstructions. Furthermore, traditional methods of feature tracking lack the robustness needed for accurate localization in varied environments. Addressing these challenges, NeSLAM introduces a sophisticated framework that integrates a depth completion and denoising network, Signed Distance Field (SDF) based hierarchical scene representation, and a NeRF-based self-supervised feature tracking algorithm, significantly advancing the capabilities of dense RGB-D SLAM systems.

Neural Implicit Mapping

NeSLAM proposes an advanced neural implicit mapping framework that leverages a novel depth completion and denoising network to process sparse and noisy depth images acquired from standard RGB-D sensors. This network significantly improves geometry prior for guiding the neural implicit representation optimization. By replacing the traditional occupancy scene representation with an SDF hierarchical scene representation, NeSLAM achieves higher quality reconstruction and view synthesis, which are crucial for robust SLAM operations. The use of SDF values over occupancy values enables a more accurate representation of scene geometry, enhancing the system's ability to model complex environments accurately.

Self-Supervised Feature Tracking

A key contribution of NeSLAM is its NeRF-based self-supervised feature tracking algorithm that ensures robust and real-time tracking in large and complex indoor scenes. By incorporating a self-supervised optimization technique that refines feature tracking during operation, the system exhibits superior generalization capabilities across diverse environments. This feature tracking method enables precise camera localization, a critical component for effective SLAM.

Experimental Results

Extensive experiments across various indoor datasets underscore the effectiveness of NeSLAM in reconstruction accuracy, tracking quality, and novel view synthesis. The introduction of a depth completion and denoising network alongside an SDF-based hierarchical scene representation allows for remarkable improvements in capturing detailed scene geometry and generating photo-realistic novel views. The NeRF-based self-supervised feature tracking further enhances localization accuracy and system robustness, surpassing the performance of existing and concurrent methods that employ implicit mapping approaches.

Implications and Future Directions

NeSLAM presents a significant advancement in dense RGB-D SLAM by addressing critical challenges associated with depth image sparsity, noise, and robust feature tracking. The integration of neural implicit techniques with traditional SLAM frameworks holds promise for developing more accurate, robust, and versatile systems capable of detailed 3D reconstruction and precise localization in dynamic and complex environments. Future research may explore the extension of this work to outdoor environments, dynamic scene understanding, and applications in autonomous navigation and augmented reality.

Conclusion

NeSLAM introduces an innovative approach to enhance dense RGB-D SLAM systems through neural implicit mapping and self-supervised feature tracking. Its ability to produce highly accurate 3D reconstructions, robust camera localization, and realistic novel view synthesis represents a notable contribution to the field. This research paves the way for future advancements in SLAM technology, with potential applications spanning robotics, virtual/augmented reality, and beyond.