SLCF-Net: Sequential LiDAR-Camera Fusion for Semantic Scene Completion using a 3D Recurrent U-Net (2403.08885v1)
Abstract: We introduce SLCF-Net, a novel approach for the Semantic Scene Completion (SSC) task that sequentially fuses LiDAR and camera data. It jointly estimates missing geometry and semantics in a scene from sequences of RGB images and sparse LiDAR measurements. The images are semantically segmented by a pre-trained 2D U-Net and a dense depth prior is estimated from a depth-conditioned pipeline fueled by Depth Anything. To associate the 2D image features with the 3D scene volume, we introduce Gaussian-decay Depth-prior Projection (GDP). This module projects the 2D features into the 3D volume along the line of sight with a Gaussian-decay function, centered around the depth prior. Volumetric semantics is computed by a 3D U-Net. We propagate the hidden 3D U-Net state using the sensor motion and design a novel loss to ensure temporal consistency. We evaluate our approach on the SemanticKITTI dataset and compare it with leading SSC approaches. The SLCF-Net excels in all SSC metrics and shows great temporal consistency.
- A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? The KITTI vision benchmark suite,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012, pp. 3354–3361.
- P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik, P. Tsui, J. Guo, Y. Zhou, Y. Chai, B. Caine et al., “Scalability in perception for autonomous driving: Waymo Open Dataset,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 2446–2454.
- J. Behley, M. Garbade, A. Milioto, J. Quenzel, S. Behnke, C. Stachniss, and J. Gall, “SemanticKITTI: A dataset for semantic scene understanding of LiDAR sequences,” in IEEE International Conference on Computer Vision (ICCV), 2019.
- G. Fahim, K. Amin, and S. Zarif, “Single-view 3D reconstruction: A survey of deep learning methods,” Computers & Graphics, vol. 94, pp. 164–190, 2021.
- L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, and H. Zhao, “Depth anything: Unleashing the power of large-scale unlabeled data,” arXiv preprint arXiv:2401.10891, 2024.
- M. Kazhdan, M. Bolitho, and H. Hoppe, “Poisson surface reconstruction,” in 4th Eurographics Symposium on Geometry Processing, vol. 7, 2006.
- M. Kazhdan and H. Hoppe, “Screened Poisson surface reconstruction,” ACM Transactions on Graphics, vol. 32, no. 3, pp. 1–13, 2013.
- S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. Funkhouser, “Semantic scene completion from a single depth image,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 1746–1754.
- L. Roldao, R. De Charette, and A. Verroust-Blondet, “3D semantic scene completion: A survey,” International Journal of Computer Vision (IJCV), vol. 130, no. 8, pp. 1978–2005, 2022.
- J. Li, Y. Liu, X. Yuan, C. Zhao, R. Siegwart, I. Reid, and C. Cadena, “Depth based semantic scene completion with position importance aware loss,” IEEE Robotics and Automation Letters (RA-L), vol. 5, no. 1, pp. 219–226, 2019.
- S.-C. Wu, K. Tateno, N. Navab, and F. Tombari, “SCFusion: Real-time incremental scene reconstruction with semantic completion,” in International Conference on 3D Vision (3DV), 2020, pp. 801–810.
- L. Roldao, R. de Charette, and A. Verroust-Blondet, “LMSCNet: Lightweight multiscale 3D semantic completion,” in International Conference on 3D Vision (3DV), 2020, pp. 111–119.
- X. Yan, J. Gao, J. Li, R. Zhang, Z. Li, R. Huang, and S. Cui, “Sparse single sweep LiDAR point cloud segmentation via learning contextual shape priors from scene completion,” in National Conference on Artificial Intelligence (AAAI), vol. 35, no. 4, 2021, pp. 3101–3109.
- R. Cheng, C. Agia, Y. Ren, X. Li, and L. Bingbing, “S3CNet: A sparse semantic scene completion network for LiDAR point clouds,” in Conference on Robot Learning (CoRL), 2021, pp. 2148–2161.
- M. Zhong and G. Zeng, “Semantic point completion network for 3D semantic scene completion,” in Europ. Conference on Artificial Intelligence (ECAI). IOS Press, 2020, pp. 2824–2831.
- Y. Cai, X. Chen, C. Zhang, K.-Y. Lin, X. Wang, and H. Li, “Semantic scene completion via integrating instances and scene in-the-loop,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 324–333.
- I. Cherabier, J. L. Schonberger, M. R. Oswald, M. Pollefeys, and A. Geiger, “Learning priors for semantic 3D reconstruction,” in Europ. Conference on Computer Vision (ECCV), 2018, pp. 314–330.
- A.-Q. Cao and R. de Charette, “MonoScene: Monocular 3D semantic scene completion,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 3991–4001.
- G. Riegler, A. O. Ulusoy, H. Bischof, and A. Geiger, “OctNetFusion: Learning depth fusion from data,” in International Conference on 3D Vision (3DV), 2017, pp. 57–66.
- J. Czarnowski, T. Laidlow, R. Clark, and A. J. Davison, “DeepFactors: Real-time probabilistic dense monocular SLAM,” IEEE Robotics and Automation Letters (RA-L), vol. 5, no. 2, pp. 721–728, 2020.
- S. Bultmann and S. Behnke, “3D semantic scene perception using distributed smart edge sensors,” in International Conference on Intelligent Autonomous Systems (IAS), 2022, pp. 313–329.
- S. Bultmann, R. Memmesheimer, and S. Behnke, “External camera-based mobile robot pose estimation for collaborative perception with smart edge sensors,” IEEE International Conference on Robotics and Automation (ICRA), 2023.
- J. Hau, S. Bultmann, and S. Behnke, “Object-level 3D semantic mapping using a network of smart edge sensors,” in IEEE International Conference on Robotic Computing (IRC), 2022, pp. 198–206.
- S. Bultmann, J. Quenzel, and S. Behnke, “Real-time multi-modal semantic fusion on unmanned aerial vehicles with label propagation for cross-domain adaptation,” Robotics and Autonomous Systems, vol. 159, p. 104286, 2023.
- L. R. Medsker and L. Jain, “Recurrent neural networks,” Design and Applications, vol. 5, no. 64-67, p. 2, 2001.
- K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014.
- S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
- K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” Advances in Neural Information Processing Systems (NeurIPS), vol. 27, 2014.
- A. Villar-Corrales, A. Karapetyan, A. Boltres, and S. Behnke, “MSPred: Video prediction at multiple spatio-temporal scales with hierarchical recurrent networks,” British Machine Vision Conference (BMVC), 2022.
- A. Villar-Corrales, I. Wahdan, and S. Behnke, “Object-centric video prediction via decoupling of object dynamics and interactions,” in IEEE International Conference on Image Processing (ICIP), 2023, pp. 570–574.
- J. Huang and G. Huang, “BEVDet4D: Exploit temporal cues in multi-camera 3D object detection,” arXiv preprint arXiv:2203.17054, 2022.
- M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in International conference on machine learning. PMLR, 2019, pp. 6105–6114.
- Q. Xie, M.-T. Luong, E. Hovy, and Q. V. Le, “Self-training with noisy student improves imagenet classification,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10 687–10 698.
- J. Philion and S. Fidler, “Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3D,” in 16th European Conference on Computer Vision (ECCV). Springer, 2020, pp. 194–210.
- J. Li, K. Han, P. Wang, Y. Liu, and X. Yuan, “Anisotropic convolutional networks for 3D semantic scene completion,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 3351–3359.
- P. Schütt, R. A. Rosu, and S. Behnke, “Abstract flow for temporal semantic segmentation on the permutohedral lattice,” in IEEE International Conference on Robotics and Automation (ICRA), 2022, pp. 5139–5145.
- Y. Lin and H. Caesar, “ICP-Flow: LiDAR scene flow estimation with ICP,” arXiv preprint arXiv:2402.17351, 2024.