Incremental Joint Learning of Depth, Pose and Implicit Scene Representation on Monocular Camera in Large-scale Scenes (2404.06050v2)
Abstract: Dense scene reconstruction for photo-realistic view synthesis has various applications, such as VR/AR, autonomous vehicles. However, most existing methods have difficulties in large-scale scenes due to three core challenges: \textit{(a) inaccurate depth input.} Accurate depth input is impossible to get in real-world large-scale scenes. \textit{(b) inaccurate pose estimation.} Most existing approaches rely on accurate pre-estimated camera poses. \textit{(c) insufficient scene representation capability.} A single global radiance field lacks the capacity to effectively scale to large-scale scenes. To this end, we propose an incremental joint learning framework, which can achieve accurate depth, pose estimation, and large-scale scene reconstruction. A vision transformer-based network is adopted as the backbone to enhance performance in scale information estimation. For pose estimation, a feature-metric bundle adjustment (FBA) method is designed for accurate and robust camera tracking in large-scale scenes. In terms of implicit scene representation, we propose an incremental scene representation method to construct the entire large-scale scene as multiple local radiance fields to enhance the scalability of 3D scene representation. Extended experiments have been conducted to demonstrate the effectiveness and accuracy of our method in depth estimation, pose estimation, and large-scale scene reconstruction.
- X. Zhao, Q. Li, C. Wang, H. Dou, and B. Liu, “Robust depth-aided visual-inertial-wheel odometry for mobile robots,” IEEE Transactions on Industrial Electronics, 2023.
- K. W. Tong, X. Y. Zhao, Y. X. Li, and P. Li, “Individual-level fmri segmentation based on graphs,” IEEE Transactions on Cognitive and Developmental Systems, vol. 15, no. 4, pp. 1773–1782, 2023.
- Z. Zheng, S. Lin, and C. Yang, “Rld-slam: A robust lightweight vi-slam for dynamic environments leveraging semantics and motion information,” IEEE Transactions on Industrial Electronics, 2024.
- K. W. Tong, J. Wu, and Y. H. Hou, “Robust drogue positioning system based on detection and tracking for autonomous aerial refueling of uavs,” IEEE Transactions on Automation Science and Engineering, 2023, early access, doi: 10.1109/TASE.2023.3308230. .
- T. Liu, Q. Cai, C. Xu, B. Hong, J. Xiong, Y. Qiao, and T. Yang, “Image captioning in news report scenario,” Academic Journal of Science and Technology, vol. 10, no. 1, pp. 284–289, 2024.
- X. Wang, Y. Qiao, J. Xiong, Z. Zhao, N. Zhang, M. Feng, and C. Jiang, “Advanced network intrusion detection with tabtransformer,” Journal of Theory and Practice of Engineering Science, vol. 4, no. 03, pp. 191–198, 2024.
- C. Wu, Z. Gong, B. Tao, K. Tan, Z. Gu, and Z. Yin, “Rf-slam: Uhf-rfid based simultaneous tags mapping and robot localization algorithm for smart warehouse position service,” IEEE Transactions on Industrial Informatics, 2023.
- W. Tong, X. Guan, J. Kang, P. Z. H. Sun, R. Law, P. Ghamisi, and E. Q. Wu, “Normal assisted pixel-visibility learning with cost aggregation for multiview stereo,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 12, pp. 24 686–24 697, 2022.
- K. W. Tong, P. Z. H. S. andE. Q. Wu, C. Wu, and Z. Jiang, “Adaptive cost volume representation for unsupervised high-resolution stereo matching,” IEEE Transactions on Intelligent Vehicles, vol. 8, no. 1, pp. 912–922, 2023.
- W. Tong, X. Guan, M. Zhang, P. Li, J. Ma, E. Q. Wu, and L.-M. Zhu, “Edge-assisted epipolar transformer for industrial scene reconstruction,” IEEE Transactions on Automation Science and Engineering, 2024, early access, doi: 10.1109/TASE.2023.3330704. .
- B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” Communications of the ACM, vol. 65, no. 1, pp. 99–106, 2021.
- T. Deng, H. Xie, J. Wang, and W. Chen, “Long-term visual simultaneous localization and mapping: Using a bayesian persistence filter-based global map prediction,” IEEE Robotics & Automation Magazine, vol. 30, DOI 10.1109/MRA.2022.3228492, no. 1, pp. 36–49, 2023.
- Y. Deng, M. Wang, Y. Yang, D. Wang, and Y. Yue, “See-csom: Sharp-edged and efficient continuous semantic occupancy mapping for mobile robots,” IEEE Transactions on Industrial Electronics, vol. 71, DOI 10.1109/TIE.2023.3262857, no. 2, pp. 1718–1728, 2024.
- J. Liu, G. Wang, C. Jiang, Z. Liu, and H. Wang, “Translo: A window-based masked point transformer framework for large-scale lidar odometry,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 2, pp. 1683–1691, 2023.
- J. Liu, G. Wang, Z. Liu, C. Jiang, M. Pollefeys, and H. Wang, “Regformer: An efficient projection-aware transformer network for large-scale point cloud registration,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 8451–8460, Oct. 2023.
- C.-H. Lin, W.-C. Ma, A. Torralba, and S. Lucey, “Barf: Bundle-adjusting neural radiance fields,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 5741–5751, Oct. 2021.
- Y. Jeong, S. Ahn, C. Choy, A. Anandkumar, M. Cho, and J. Park, “Self-calibrating neural radiance fields,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5846–5854, 2021.
- T. Deng, G. Shen, T. Qin, J. Wang, W. Zhao, J. Wang, D. Wang, and W. Chen, “Plgslam: Progressive neural scene represenation with local to global bundle adjustment,” arXiv preprint arXiv:2312.09866, 2023.
- T. Deng, Y. Chen, L. Zhang, J. Yang, S. Yuan, D. Wang, and W. Chen, “Compact 3d gaussian splatting for dense visual slam,” arXiv preprint arXiv:2403.11247, 2024.
- T. Deng, Y. Wang, H. Xie, H. Wang, J. Wang, D. Wang, and W. Chen, “Neslam: Neural implicit mapping and self-supervised feature tracking with depth completion and denoising,” arXiv preprint arXiv:2403.20034, 2024.
- A. Knapitsch, J. Park, Q.-Y. Zhou, and V. Koltun, “Tanks and temples: Benchmarking large-scale scene reconstruction,” ACM Transactions on Graphics (ToG), vol. 36, no. 4, pp. 1–13, 2017.
- W. Bian, Z. Wang, K. Li, J.-W. Bian, and V. A. Prisacariu, “Nope-nerf: Optimising neural radiance field with no pose prior,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4160–4169, 2023.
- A. Meuleman, Y.-L. Liu, C. Gao, J.-B. Huang, C. Kim, M. H. Kim, and J. Kopf, “Progressively optimized local radiance fields for robust view synthesis,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16 539–16 548, 2023.
- A. Roy and S. Todorovic, “Monocular depth estimation using neural regression forest,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5506–5514, 2016.
- N. Mayer, E. Ilg, P. Fischer, C. Hazirbas, D. Cremers, A. Dosovitskiy, and T. Brox, “What makes good synthetic training data for learning disparity and optical flow estimation?” International Journal of Computer Vision, vol. 126, pp. 942–960, 2018.
- Z. Chen, L. Jing, L. Yang, Y. Li, and B. Li, “Class-level confidence based 3d semi-supervised learning,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 633–642, 2023.
- Z. Chen, L. Jing, Y. Liang, Y. Tian, and B. Li, “Multimodal semi-supervised learning for 3d objects,” arXiv preprint arXiv:2110.11601, 2021.
- Z. Chen, L. Jing, Y. Li, and B. Li, “Bridging the domain gap: Self-supervised 3d scene understanding with foundation models,” Advances in Neural Information Processing Systems, vol. 36, 2024.
- Z. Chen, Y. Li, L. Jing, L. Yang, and B. Li, “Point cloud self-supervised learning via 3d to multi-view masked autoencoder,” arXiv preprint arXiv:2311.10887, 2023.
- C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised monocular depth estimation with left-right consistency,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 270–279, 2017.
- J.-W. Bian, H. Zhan, N. Wang, Z. Li, L. Zhang, C. Shen, M.-M. Cheng, and I. Reid, “Unsupervised scale-consistent depth learning from video,” International Journal of Computer Vision, vol. 129, no. 9, pp. 2548–2564, 2021.
- R. Garg, N. Wadhwa, S. Ansari, and J. T. Barron, “Learning single camera depth estimation using dual-pixels,” in Proceedings of the IEEE/CVF international conference on computer vision, pp. 7628–7637, 2019.
- T. Deng, S. Liu, X. Wang, Y. Liu, D. Wang, and W. Chen, “Prosgnerf: Progressive dynamic neural scene graph with frequency modulated auto-encoder in urban scenes,” arXiv preprint arXiv:2312.09076, 2023.
- W. He, Z. Jiang, C. Zhang, and A. M. Sainju, “Curvanet: Geometric deep learning based on directional curvature for 3d shape analysis,” in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2214–2224, 2020.
- W. He, A. M. Sainju, Z. Jiang, and D. Yan, “Deep neural network for 3d surface segmentation based on contour tree hierarchy,” in Proceedings of the 2021 SIAM International Conference on Data Mining (SDM), pp. 253–261. SIAM, 2021.
- T. Liu, Q. Cai, C. Xu, B. Hong, F. Ni, Y. Qiao, and T. Yang, “Rumor detection with a novel graph neural network approach,” Academic Journal of Science and Technology, vol. 10, no. 1, pp. 305–310, 2024.
- H. Xie, T. Deng, J. Wang, and W. Chen, “Robust incremental long-term visual topological localization in changing environments,” IEEE Transactions on Instrumentation and Measurement, vol. 72, pp. 1–14, 2022.
- H. Xie, T. Deng, J. Wang, and W. Chen, “Angular tracking consistency guided fast feature association for visual-inertial slam,” IEEE Transactions on Instrumentation and Measurement, 2024.
- T. Liu, C. Xu, Y. Qiao, C. Jiang, and J. Yu, “Particle filter slam for vehicle localization,” Journal of Industrial Engineering and Applied Science, vol. 2, no. 1, pp. 27–31, 2024.
- L. Yen-Chen, P. Florence, J. T. Barron, A. Rodriguez, P. Isola, and T.-Y. Lin, “inerf: Inverting neural radiance fields for pose estimation,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1323–1330. IEEE, 2021.
- S. Zhu, G. Wang, H. Blum, J. Liu, L. Song, M. Pollefeys, and H. Wang, “Sni-slam: Semantic neural implicit slam,” arXiv preprint arXiv:2311.11016, 2023.
- S. Zhu, R. Qin, G. Wang, J. Liu, and H. Wang, “Semgauss-slam: Dense semantic gaussian splatting slam,” arXiv preprint arXiv:2403.07494, 2024.
- M. Li, S. Liu, and H. Zhou, “Sgs-slam: Semantic gaussian splatting for neural dense slam,” arXiv preprint arXiv:2402.03246, 2024.
- M. Li, J. He, G. Jiang, and H. Wang, “Ddn-slam: Real-time dense dynamic neural implicit slam with joint semantic encoding,” arXiv preprint arXiv:2401.01545, 2024.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
- A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo et al., “Segment anything,” arXiv preprint arXiv:2304.02643, 2023.
- R. Ranftl, A. Bochkovskiy, and V. Koltun, “Vision transformers for dense prediction,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 12 179–12 188, Oct. 2021.
- P.-E. Sarlin, A. Unagar, M. Larsson, H. Germain, C. Toft, V. Larsson, M. Pollefeys, V. Lepetit, L. Hammarstrand, F. Kahl et al., “Back to the feature: Learning robust camera localization from pixels to pose,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3247–3257, 2021.
- Z. Wang, S. Wu, W. Xie, M. Chen, and V. A. Prisacariu, “Nerf–: Neural radiance fields without known camera parameters,” arXiv preprint arXiv:2102.07064, 2021.
- J. T. Barron, B. Mildenhall, D. Verbin, P. P. Srinivasan, and P. Hedman, “Mip-nerf 360: Unbounded anti-aliased neural radiance fields,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5470–5479, 2022.
- T.-M. Nguyen, D. Duberg, P. Jensfelt, S. Yuan, and L. Xie, “Slict: Multi-input multi-scale surfel-based lidar-inertial continuous-time odometry and mapping,” IEEE Robotics and Automation Letters, vol. 8, no. 4, pp. 2102–2109, 2023.
- D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proceedings of the International Conference on Learning Representations, 2015.