Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PLGSLAM: Progressive Neural Scene Represenation with Local to Global Bundle Adjustment (2312.09866v2)

Published 15 Dec 2023 in cs.CV

Abstract: Neural implicit scene representations have recently shown encouraging results in dense visual SLAM. However, existing methods produce low-quality scene reconstruction and low-accuracy localization performance when scaling up to large indoor scenes and long sequences. These limitations are mainly due to their single, global radiance field with finite capacity, which does not adapt to large scenarios. Their end-to-end pose networks are also not robust enough with the growth of cumulative errors in large scenes. To this end, we introduce PLGSLAM, a neural visual SLAM system capable of high-fidelity surface reconstruction and robust camera tracking in real-time. To handle large-scale indoor scenes, PLGSLAM proposes a progressive scene representation method which dynamically allocates new local scene representation trained with frames within a local sliding window. This allows us to scale up to larger indoor scenes and improves robustness (even under pose drifts). In local scene representation, PLGSLAM utilizes tri-planes for local high-frequency features with multi-layer perceptron (MLP) networks for the low-frequency feature, achieving smoothness and scene completion in unobserved areas. Moreover, we propose local-to-global bundle adjustment method with a global keyframe database to address the increased pose drifts on long sequences. Experimental results demonstrate that PLGSLAM achieves state-of-the-art scene reconstruction results and tracking performance across various datasets and scenarios (both in small and large-scale indoor environments).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. Neural rgb-d surface reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6290–6301, June 2022.
  2. Transformerfusion: Monocular rgb scene reconstruction using transformers. Advances in Neural Information Processing Systems, 34:1403–1414, 2021.
  3. Volumefusion: Deep depth fusion for 3d scene reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 16086–16095, October 2021.
  4. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  5. Bundlefusion: Real-time globally consistent 3d reconstruction using on-the-fly surface reintegration. ACM Trans. Graph., 36(4), jul 2017.
  6. Plenoxels: Radiance fields without neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5501–5510, 2022.
  7. Kinectfusion: real-time 3d reconstruction and interaction using a moving depth camera. In Proceedings of the 24th annual ACM symposium on User interface software and technology, pages 559–568, 2011.
  8. Eslam: Efficient dense slam system based on hybrid representation of signed distance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17408–17419, 2023.
  9. Parallel tracking and mapping for small ar workspaces. In 2007 6th IEEE and ACM International Symposium on Mixed and Augmented Reality, pages 225–234, 2007.
  10. Progressively optimized local radiance fields for robust view synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16539–16548, 2023.
  11. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
  12. Orb-slam: A versatile and accurate monocular slam system. IEEE Transactions on Robotics, 31(5):1147–1163, 2015.
  13. Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras. IEEE Transactions on Robotics, 33(5):1255–1262, 2017.
  14. Dtam: Dense tracking and mapping in real-time. In 2011 international conference on computer vision, pages 2320–2327. IEEE, 2011.
  15. Unisurf: Unifying neural implicit surfaces and radiance fields for multi-view reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 5589–5599, October 2021.
  16. Vins-mono: A robust and versatile monocular visual-inertial state estimator. IEEE Transactions on Robotics, 34(4):1004–1020, 2018.
  17. Bad slam: Bundle adjusted direct rgb-d slam. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  18. The replica dataset: A digital replica of indoor spaces. arXiv preprint arXiv:1906.05797, 2019.
  19. imap: Implicit mapping and positioning in real-time. In ICCV, pages 6229–6238, October 2021.
  20. Neuralrecon: Real-time coherent 3d reconstruction from monocular video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15598–15607, June 2021.
  21. Ba-net: Dense bundle adjustment network. ICLR, 2018.
  22. Mega-nerf: Scalable construction of large-scale nerfs for virtual fly-throughs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12922–12931, 2022.
  23. Co-slam: Joint coordinate and sparse parametric encodings for neural real-time slam. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13293–13302, 2023.
  24. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 27171–27183. Curran Associates, Inc., 2021.
  25. Nerf–: Neural radiance fields without known camera parameters. arXiv preprint arXiv:2102.07064, 2021.
  26. Bungeenerf: Progressive neural radiance field for extreme multi-scale scene rendering. In European conference on computer vision, pages 106–122. Springer, 2022.
  27. Vox-fusion: Dense tracking and mapping with voxel-based neural implicit representation. In 2022 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), pages 499–507, 2022.
  28. inerf: Inverting neural radiance fields for pose estimation. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1323–1330, 2021.
  29. Nice-slam: Neural implicit scalable encoding for slam. In CVPR, pages 12786–12796, June 2022.
Citations (46)

Summary

  • The paper introduces PLGSLAM, achieving high-fidelity 3D reconstruction and robust pose tracking with a progressive scene representation strategy.
  • It combines joint tri-planes with MLP networks to capture high- and low-frequency features, ensuring detailed and smooth reconstructions in unobserved areas.
  • Local-to-global bundle adjustment effectively mitigates pose drift over long sequences, outperforming existing SLAM methods across diverse indoor environments.

Overview

Neural implicit scene representations have emerged as an innovative approach for dense visual SLAM (Simultaneous Localization and Mapping). These representations have demonstrated promising results in rendering dense 3D environments from visual data. However, scaling up to larger indoor scenes and longer sequences has been a challenge, often resulting in deteriorated scene reconstruction and localization performance. This is partly due to the limitations of using a single, global radiance field and end-to-end pose networks which are not robust enough to handle large environments.

Methodology

The paper introduces PLGSLAM, a new neural visual SLAM system, which aims to deliver high-fidelity surface reconstruction along with robust camera tracking in real-time for large-scale indoor scenes. The system addresses scalability and robustness issues through several innovations:

  • Progressive Scene Representation: It dynamically allocates new local scene representations as the camera explores the environment, effectively dividing the scene into multiple manageable parts, which improves scalability and robustness.
  • Joint Tri-Planes and MLP Networks: Local scene representations utilize tri-planes for encoding local high-frequency features and MLP networks for low-frequency characteristics, providing detailed, smooth, and complete reconstructions, even in previously unobserved areas.
  • Local-to-Global Bundle Adjustment: The system incorporates a local-to-global bundle adjustment method, which mitigates pose drift over long sequences by utilizing a global keyframe database.

Experimental results have shown that PLGSLAM outperforms other methods in both scene reconstruction fidelity and pose estimation accuracy across various datasets and scenarios.

Experimental Results

PLGSLAM has been tested on multiple datasets that feature different indoor environments, ranging from smaller rooms to large multi-room apartments. The system demonstrated state-of-the-art performance in terms of accuracy of 3D reconstruction and pose estimation compared to existing methods. The experiments highlighted PLGSLAM's ability to handle long video sequences and large-scale indoor scenes effectively.

Conclusion and Future Work

The paper presents PLGSLAM as a system capable of addressing key challenges associated with scaling neural implicit scene representations for dense visual SLAM in large indoor settings. With its novel scene representation and adjustment techniques, the system shows significant improvements in both scene reconstruction and localization tasks. The proposed method paves the way for more robust and accurate SLAM systems that could greatly benefit applications in autonomous driving, robotics, and augmented reality. The code will be open-sourced, allowing for wider use and further development upon paper acceptance.

X Twitter Logo Streamline Icon: https://streamlinehq.com