Papers
Topics
Authors
Recent
Search
2000 character limit reached

EPRecon: An Efficient Framework for Real-Time Panoptic 3D Reconstruction from Monocular Video

Published 3 Sep 2024 in cs.CV | (2409.01807v2)

Abstract: Panoptic 3D reconstruction from a monocular video is a fundamental perceptual task in robotic scene understanding. However, existing efforts suffer from inefficiency in terms of inference speed and accuracy, limiting their practical applicability. We present EPRecon, an efficient real-time panoptic 3D reconstruction framework. Current volumetric-based reconstruction methods usually utilize multi-view depth map fusion to obtain scene depth priors, which is time-consuming and poses challenges to real-time scene reconstruction. To address this issue, we propose a lightweight module to directly estimate scene depth priors in a 3D volume for reconstruction quality improvement by generating occupancy probabilities of all voxels. In addition, compared with existing panoptic segmentation methods, EPRecon extracts panoptic features from both voxel features and corresponding image features, obtaining more detailed and comprehensive instance-level semantic information and achieving more accurate segmentation results. Experimental results on the ScanNetV2 dataset demonstrate the superiority of EPRecon over current state-of-the-art methods in terms of both panoptic 3D reconstruction quality and real-time inference. Code is available at https://github.com/zhen6618/EPRecon.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. D. Wu, Z. Yan, and H. Zha, “Panorecon: Real-time panoptic 3d reconstruction from monocular video,” in IEEE Conference on Computer Vision and Pattern Recognition, June 2024, pp. 21 507–21 518.
  2. M. Han, Z. Zhang, Z. Jiao, X. Xie, Y. Zhu, S.-C. Zhu, and H. Liu, “Reconstructing interactive 3d scenes by panoptic mapping and cad model alignments,” in IEEE International Conference on Robotics and Automation, 2021, pp. 12 199–12 206.
  3. G. Narita, T. Seno, T. Ishikawa, and Y. Kaji, “Panopticfusion: Online volumetric semantic mapping at the level of stuff and things,” in IEEE/RSJ International Conference on Intelligent Robots and Systems, 2019, pp. 4205–4212.
  4. M. He, C. Zhu, Q. Huang, B. Ren, and J. Liu, “A review of monocular visual odometry,” in The Visual Computer, vol. 36, no. 5, May 2020, pp. 1053–1065.
  5. J. Jiang, X. Luo, Q. Luo, L. Qiao, and M. Li, “An overview of hand-eye calibration,” International Journal of Advanced Manufacturing Technology, vol. 119, no. 1, pp. 77–97, Mar 2022.
  6. C. Campos, R. Elvira, J. J. G. Rodríguez, J. M. M. Montiel, and J. D. Tardós, “Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam,” IEEE Transactions on Robotics, vol. 37, no. 6, pp. 1874–1890, 2021.
  7. C.-M. Chung, Y.-C. Tseng, Y.-C. Hsu, X.-Q. Shi, Y.-H. Hua, J.-F. Yeh, W.-C. Chen, Y.-T. Chen, and W. H. Hsu, “Orbeez-slam: A real-time monocular visual slam with orb features and nerf-realized mapping,” in IEEE International Conference on Robotics and Automation, 2023, pp. 9400–9406.
  8. Y. Yao, Z. Luo, S. Li, T. Fang, and L. Quan, “Mvsnet: Depth inference for unstructured multi-view stereo,” in European Conference on Computer Vision, September 2018.
  9. K. Wang and S. Shen, “Mvdepthnet: Real-time multiview depth estimation neural network,” in International Conference on 3D Vision, 2018, pp. 248–257.
  10. Y. Yao, Z. Luo, S. Li, T. Shen, T. Fang, and L. Quan, “Recurrent mvsnet for high-resolution multi-view stereo depth inference,” in IEEE Conference on Computer Vision and Pattern Recognition, June 2019.
  11. X. Gu, Z. Fan, S. Zhu, Z. Dai, F. Tan, and P. Tan, “Cascade cost volume for high-resolution multi-view stereo and stereo matching,” in IEEE Conference on Computer Vision and Pattern Recognition, June 2020.
  12. Z. Yu and S. Gao, “Fast-mvsnet: Sparse-to-dense multi-view stereo with learned propagation and gauss-newton refinement,” in IEEE Conference on Computer Vision and Pattern Recognition, June 2020.
  13. Z. Murez, T. van As, J. Bartolozzi, A. Sinha, V. Badrinarayanan, and A. Rabinovich, “Atlas: End-to-end 3d scene reconstruction from posed images,” in European Conference on Computer Vision, 2020, pp. 414–431.
  14. J. Sun, Y. Xie, L. Chen, X. Zhou, and H. Bao, “Neuralrecon: Real-time coherent 3d reconstruction from monocular video,” in IEEE Conference on Computer Vision and Pattern Recognition, June 2021, pp. 15 598–15 607.
  15. A. Bozic, P. Palafox, J. Thies, A. Dai, and M. Niessner, “Transformerfusion: Monocular rgb scene reconstruction using transformers,” in Advances in Neural Information Processing Systems, vol. 34, 2021, pp. 1403–1414.
  16. Y. Li, Z. Yu, C. Choy, C. Xiao, J. M. Alvarez, S. Fidler, C. Feng, and A. Anandkumar, “Voxformer: Sparse voxel transformer for camera-based 3d semantic scene completion,” in IEEE Conference on Computer Vision and Pattern Recognition, June 2023, pp. 9087–9098.
  17. X. Zuo, N. Yang, N. Merrill, B. Xu, and S. Leutenegger, “Incremental dense reconstruction from monocular video with guided sparse feature volume fusion,” IEEE Robotics and Automation Letters, vol. 8, no. 6, pp. 3876–3883, 2023.
  18. M. Sayed, J. Gibson, J. Watson, V. Prisacariu, M. Firman, and C. Godard, “Simplerecon: 3d reconstruction without 3d convolutions,” in European Conference on Computer Vision, 2022, pp. 1–19.
  19. H. Liu, H. Wang, Y. Chen, Z. Yang, J. Zeng, L. Chen, and L. Wang, “Fully sparse 3d panoptic occupancy prediction,” arXiv preprint arXiv:2312.17118, 2023.
  20. X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable detr: Deformable transformers for end-to-end object detection,” International Conference on Learning Representations, 2021.
  21. W. Yuan, X. Gu, H. Li, Z. Dong, and S. Zhu, “3d-former: Monocular scene reconstruction with 3d sdf transformers,” in International Conference on Learning Representations, 2023.
  22. Z. Feng, L. Yang, P. Guo, and B. Li, “Cvrecon: Rethinking 3d geometric feature learning for neural reconstruction,” in IEEE International Conference on Computer Vision, October 2023, pp. 17 750–17 760.
  23. N. Stier, A. Rich, P. Sen, and T. Höllerer, “Vortx: Volumetric 3d reconstruction with transformers for voxelwise view selection and fusion,” in International Conference on 3D Vision, 2021, pp. 320–330.
  24. X. Wang, Z. Zhu, W. Xu, Y. Zhang, Y. Wei, X. Chi, Y. Ye, D. Du, J. Lu, and X. Wang, “Openoccupancy: A large scale benchmark for surrounding semantic occupancy perception,” in IEEE International Conference on Computer Vision, October 2023, pp. 17 850–17 859.
  25. H. Gao, Y. Liu, F. Cao, H. Wu, F. Xu, and S. Zhong, “Vidar: Data quality improvement for monocular 3d reconstruction through in-situ visual interaction,” in IEEE International Conference on Robotics and Automation, 2024, pp. 7895–7901.
  26. J. Ju, C. W. Tseng, O. Bailo, G. Dikov, and M. Ghafoorian, “Dg-recon: Depth-guided neural 3d scene reconstruction,” in IEEE International Conference on Computer Vision, October 2023, pp. 18 184–18 194.
  27. N. Stier, A. Ranjan, A. Colburn, Y. Yan, L. Yang, F. Ma, and B. Angles, “Finerecon: Depth-aware feed-forward network for detailed 3d reconstruction,” in IEEE International Conference on Computer Vision, October 2023, pp. 18 423–18 432.
  28. J. Mei, Y. Yang, M. Wang, X. Hou, L. Li, and Y. Liu, “Panet: Lidar panoptic segmentation with sparse instance proposal and aggregation,” in IEEE/RSJ International Conference on Intelligent Robots and Systems, 2023, pp. 7726–7733.
  29. F. Hong, L. Kong, H. Zhou, X. Zhu, H. Li, and Z. Liu, “Unified 3d and 4d panoptic segmentation via dynamic shifting networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 5, pp. 3480–3495, 2024.
  30. J. Behley, A. Milioto, and C. Stachniss, “A benchmark for lidar-based panoptic segmentation based on kitti,” in IEEE International Conference on Robotics and Automation, 2021, pp. 13 596–13 603.
  31. S. Xu, R. Wan, M. Ye, X. Zou, and T. Cao, “Sparse cross-scale attention network for efficient lidar panoptic segmentation,” AAAI Conference on Artificial Intelligence, vol. 36, no. 3, pp. 2920–2928, Jun. 2022.
  32. J. Li, X. He, Y. Wen, Y. Gao, X. Cheng, and D. Zhang, “Panoptic-phnet: Towards real-time and high-precision lidar panoptic segmentation via clustering pseudo heatmap,” in IEEE Conference on Computer Vision and Pattern Recognition, June 2022, pp. 11 809–11 818.
  33. R. Razani, R. Cheng, E. Li, E. Taghavi, Y. Ren, and L. Bingbing, “Gp-s3net: Graph-based panoptic sparse semantic segmentation network,” in IEEE International Conference on Computer Vision, October 2021, pp. 16 076–16 085.
  34. D. Ye, Z. Zhou, W. Chen, Y. Xie, Y. Wang, P. Wang, and H. Foroosh, “Lidarmultinet: Towards a unified multi-task network for lidar perception,” AAAI Conference on Artificial Intelligence, vol. 37, no. 3, pp. 3231–3240, Jun. 2023.
  35. Z. Zhang, Z. Zhang, Q. Yu, R. Yi, Y. Xie, and L. Ma, “Lidar-camera panoptic segmentation via geometry-consistent and semantic-aware alignment,” in IEEE International Conference on Computer Vision, October 2023, pp. 3662–3671.
  36. R. Marcuzzi, L. Nunes, L. Wiesmann, J. Behley, and C. Stachniss, “Mask-based panoptic lidar segmentation for autonomous driving,” IEEE Robotics and Automation Letters, vol. 8, no. 2, pp. 1141–1148, 2023.
  37. K. Yilmaz, J. Schult, A. Nekrasov, and B. Leibe, “Mask4former: Mask transformer for 4d panoptic segmentation,” in IEEE International Conference on Robotics and Automation, 2024, pp. 9418–9425.
  38. W. Ye, X. Lan, S. Chen, Y. Ming, X. Yu, H. Bao, Z. Cui, and G. Zhang, “Pvo: Panoptic visual odometry,” in IEEE Conference on Computer Vision and Pattern Recognition, June 2023, pp. 9579–9589.
  39. D. Kim, S. Woo, J.-Y. Lee, and I. S. Kweon, “Video panoptic segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition, June 2020.
  40. S. Qiao, Y. Zhu, H. Adam, A. Yuille, and L.-C. Chen, “Vip-deeplab: Learning visual perception with depth-aware video panoptic segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition, June 2021, pp. 3997–4008.
  41. H. Zhu, C. Yao, Z. Zhu, Z. Liu, and Z. Jia, “Fusing panoptic segmentation and geometry information for robust visual slam in dynamic environments,” in IEEE International Conference on Automation Science and Engineering, 2022, pp. 1648–1653.
  42. R. Mur-Artal and J. D. Tardós, “Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras,” IEEE Transactions on Robotics, vol. 33, no. 5, pp. 1255–1262, 2017.
  43. B. Cheng, A. Schwing, and A. Kirillov, “Per-pixel classification is not all you need for semantic segmentation,” in Advances in Neural Information Processing Systems, vol. 34, 2021, pp. 17 864–17 875.
  44. Z. Yu, C. Shu, Q. Sun, J. Linghu, X. Wei, J. Yu, Z. Liu, D. Yang, H. Li, and Y. Chen, “Panoptic-flashocc: An efficient baseline to marry semantic occupancy with panoptic via instance center,” in arXiv preprint arXiv:2406.10527, 2024.
  45. X. Fu, S. Zhang, T. Chen, Y. Lu, L. Zhu, X. Zhou, A. Geiger, and Y. Liao, “Panoptic nerf: 3d-to-2d label transfer for panoptic urban scene segmentation,” in International Conference on 3D Vision, 2022, pp. 1–11.
  46. A. Kundu, K. Genova, X. Yin, A. Fathi, C. Pantofaru, L. J. Guibas, A. Tagliasacchi, F. Dellaert, and T. Funkhouser, “Panoptic neural fields: A semantic object-aware neural scene representation,” in IEEE Conference on Computer Vision and Pattern Recognition, June 2022, pp. 12 871–12 881.
  47. C. Smitt, M. Halstead, P. Zimmer, T. Läbe, E. Guclu, C. Stachniss, and C. McCool, “Pag-nerf: Towards fast and efficient end-to-end panoptic 3d representations for agricultural robotics,” IEEE Robotics and Automation Letters, vol. 9, no. 1, pp. 907–914, 2024.
  48. B. Dou, T. Zhang, Y. Ma, Z. Wang, and Z. Yuan, “Cosseggaussians: Compact and swift scene segmenting 3d gaussians with dual feature fusion,” arXiv preprint arXiv:2401.05925, 2024.
  49. S. Contributors, “Spconv: Spatially sparse convolution library,” https://github.com/traveller59/spconv, 2022.
  50. J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.
  51. W. E. Lorensen and H. E. Cline, “Marching cubes: A high resolution 3d surface construction algorithm,” in SIGGRAPH, 1987, pp. 163–169.
  52. B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Masked-attention mask transformer for universal image segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition, June 2022, pp. 1290–1299.
  53. J. Schult, F. Engelmann, A. Hermans, O. Litany, S. Tang, and B. Leibe, “Mask3d: Mask transformer for 3d semantic instance segmentation,” in IEEE International Conference on Robotics and Automation, 2023, pp. 8216–8223.
  54. F. Milletari, N. Navab, and S.-A. Ahmadi, “V-net: Fully convolutional neural networks for volumetric medical image segmentation,” in International Conference on 3D Vision, 2016, pp. 565–571.
  55. A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Niessner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” in IEEE Conference on Computer Vision and Pattern Recognition, July 2017.
  56. S. Im, H.-G. Jeon, S. Lin, and I. S. Kweon, “Dpsnet: End-to-end deep plane sweep stereo,” in arXiv preprint arXiv:1905.00538, 2019.
  57. Y. Hou, J. Kannala, and A. Solin, “Multi-view stereo by temporal nonparametric fusion,” in IEEE International Conference on Computer Vision, October 2019.
  58. C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,” in Advances in Neural Information Processing Systems, vol. 30, 2017.
  59. H. Su, V. Jampani, D. Sun, S. Maji, E. Kalogerakis, M.-H. Yang, and J. Kautz, “Splatnet: Sparse lattice networks for point cloud processing,” in IEEE Conference on Computer Vision and Pattern Recognition, June 2018.
  60. A. Dai and M. Niessner, “3dmv: Joint 3d-multi-view prediction for 3d semantic scene segmentation,” in European Conference on Computer Vision, September 2018.
  61. D. Menini, S. Kumar, M. R. Oswald, E. Sandström, C. Sminchisescu, and L. Van Gool, “A real-time online learning framework for joint 3d reconstruction and semantic segmentation of indoor scenes,” IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 1332–1339, 2022.
  62. B. Graham, M. Engelcke, and L. van der Maaten, “3d semantic segmentation with submanifold sparse convolutional networks,” in IEEE Conference on Computer Vision and Pattern Recognition, June 2018.
  63. C. Choy, J. Gwak, and S. Savarese, “4d spatio-temporal convnets: Minkowski convolutional neural networks,” in IEEE Conference on Computer Vision and Pattern Recognition, June 2019.
  64. W. Wang, R. Yu, Q. Huang, and U. Neumann, “Sgpn: Similarity group proposal network for 3d point cloud instance segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition, June 2018.
  65. X. Wang, S. Liu, X. Shen, C. Shen, and J. Jia, “Associatively segmenting instances and semantics in point clouds,” in IEEE Conference on Computer Vision and Pattern Recognition, June 2019.
  66. L. Yi, W. Zhao, H. Wang, M. Sung, and L. J. Guibas, “Gspn: Generative shape proposal network for 3d instance segmentation in point cloud,” in IEEE Conference on Computer Vision and Pattern Recognition, June 2019.
  67. A. Tao, Y. Duan, Y. Wei, J. Lu, and J. Zhou, “Seggroup: Seg-level supervision for 3d instance and semantic segmentation,” IEEE Transactions on Image Processing, vol. 31, pp. 4952–4965, 2022.
  68. W. Zhao, Y. Yan, C. Yang, J. Ye, X. Yang, and K. Huang, “Divide and conquer: 3d point cloud instance segmentation with point-wise binarization,” in IEEE International Conference on Computer Vision, October 2023, pp. 562–571.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.