Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
162 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Self-supervised Learning of LiDAR 3D Point Clouds via 2D-3D Neural Calibration (2401.12452v3)

Published 23 Jan 2024 in cs.CV

Abstract: This paper introduces a novel self-supervised learning framework for enhancing 3D perception in autonomous driving scenes. Specifically, our approach, namely NCLR, focuses on 2D-3D neural calibration, a novel pretext task that estimates the rigid pose aligning camera and LiDAR coordinate systems. First, we propose the learnable transformation alignment to bridge the domain gap between image and point cloud data, converting features into a unified representation space for effective comparison and matching. Second, we identify the overlapping area between the image and point cloud with the fused features. Third, we establish dense 2D-3D correspondences to estimate the rigid pose. The framework not only learns fine-grained matching from points to pixels but also achieves alignment of the image and point cloud at a holistic level, understanding their relative pose. We demonstrate the efficacy of NCLR by applying the pre-trained backbone to downstream tasks, such as LiDAR-based 3D semantic segmentation, object detection, and panoptic segmentation. Comprehensive experiments on various datasets illustrate the superiority of NCLR over existing self-supervised methods. The results confirm that joint learning from different modalities significantly enhances the network's understanding abilities and effectiveness of learned representation. The code is publicly available at https://github.com/Eaphan/NCLR.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. L. Jing and Y. Tian, “Self-supervised visual feature learning with deep neural networks: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 11, pp. 4037–4058, 2020.
  2. X. Liu, F. Zhang, Z. Hou, L. Mian, Z. Wang, J. Zhang, and J. Tang, “Self-supervised learning: Generative or contrastive,” IEEE Transactions on Knowledge and Data Engineering, vol. 35, no. 1, pp. 857–876, 2021.
  3. S. Xie, J. Gu, D. Guo, C. R. Qi, L. Guibas, and O. Litany, “Pointcontrast: Unsupervised pre-training for 3d point cloud understanding,” in European Conference on Computer Vision, 2020, pp. 574–591.
  4. Y. Chen, M. Nießner, and A. Dai, “4dcontrast: Contrastive learning with dynamic correspondences for 3d scene understanding,” in European Conference on Computer Vision, 2022, pp. 543–560.
  5. L. Nunes, R. Marcuzzi, X. Chen, J. Behley, and C. Stachniss, “Segcontrast: 3d point cloud feature representation learning through self-supervised segment discrimination,” IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 2116–2123, 2022.
  6. H. Liang, C. Jiang, D. Feng, X. Chen, H. Xu, X. Liang, W. Zhang, Z. Li, and L. Van Gool, “Exploring geometry-aware contrast and clustering harmonization for self-supervised 3d object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3293–3302.
  7. J. Yin, D. Zhou, L. Zhang, J. Fang, C.-Z. Xu, J. Shen, and W. Wang, “Proposalcontrast: Unsupervised pre-training for lidar-based 3d object detection,” in European Conference on Computer Vision, 2022, pp. 17–33.
  8. Z. Zhang, R. Girdhar, A. Joulin, and I. Misra, “Self-supervised pretraining of 3d features on any point-cloud,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 252–10 263.
  9. C. Sautier, G. Puy, A. Boulch, R. Marlet, and V. Lepetit, “Bevcontrast: Self-supervision in bev space for automotive lidar point clouds,” arXiv preprint arXiv:2310.17281, 2023.
  10. L. Nunes, L. Wiesmann, R. Marcuzzi, X. Chen, J. Behley, and C. Stachniss, “Temporal consistent 3d lidar representation learning for semantic perception in autonomous driving,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 5217–5228.
  11. S. Huang, Y. Xie, S.-C. Zhu, and Y. Zhu, “Spatio-temporal self-supervised representation learning for 3d point clouds,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 6535–6545.
  12. Y. Wu, T. Zhang, W. Ke, S. Süsstrunk, and M. Salzmann, “Spatiotemporal self-supervised learning for point clouds in the wild,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 5251–5260.
  13. Y. Pang, W. Wang, F. E. Tay, W. Liu, Y. Tian, and L. Yuan, “Masked autoencoders for point cloud self-supervised learning,” in European Conference on Computer Vision, 2022, pp. 604–621.
  14. R. Zhang, Z. Guo, P. Gao, R. Fang, B. Zhao, D. Wang, Y. Qiao, and H. Li, “Point-m2ae: multi-scale masked autoencoders for hierarchical point cloud pre-training,” Advances in Neural Information Processing Systems, vol. 35, pp. 27 061–27 074, 2022.
  15. A. Boulch, C. Sautier, B. Michele, G. Puy, and R. Marlet, “Also: Automotive lidar self-supervision by occupancy estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 13 455–13 465.
  16. Z. Li, Z. Chen, A. Li, L. Fang, Q. Jiang, X. Liu, J. Jiang, B. Zhou, and H. Zhao, “Simipu: Simple 2d image and 3d point cloud unsupervised pre-training for spatial-aware visual representations,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2022, pp. 1500–1508.
  17. C. Sautier, G. Puy, S. Gidaris, A. Boulch, A. Bursuc, and R. Marlet, “Image-to-lidar self-supervised distillation for autonomous driving data,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 9891–9901.
  18. A. Mahmoud, J. S. Hu, T. Kuai, A. Harakeh, L. Paull, and S. L. Waslander, “Self-supervised image-to-point distillation via semantically tolerant contrastive loss,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 7102–7110.
  19. J. Behley, M. Garbade, A. Milioto, J. Quenzel, S. Behnke, C. Stachniss, and J. Gall, “Semantickitti: A dataset for semantic scene understanding of lidar sequences,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9297–9307.
  20. A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2012, pp. 3354–3361.
  21. H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 11 621–11 631.
  22. Y. Pan, B. Gao, J. Mei, S. Geng, C. Li, and H. Zhao, “Semanticposs: A point cloud dataset with large quantity of dynamic instances,” in IEEE Intelligent Vehicles Symposium, 2020, pp. 687–693.
  23. R. Zhang, P. Isola, and A. A. Efros, “Colorful image colorization,” in European Conference on Computer Vision, 2016, pp. 649–666.
  24. C. Doersch, A. Gupta, and A. A. Efros, “Unsupervised visual representation learning by context prediction,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2015, pp. 1422–1430.
  25. P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan, “Supervised contrastive learning,” Advances in Neural Information Processing Systems, vol. 33, pp. 18 661–18 673, 2020.
  26. T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in International Conference on Machine Learning, 2020, pp. 1597–1607.
  27. J. Donahue and K. Simonyan, “Large scale adversarial representation learning,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  28. K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 000–16 009.
  29. M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 9650–9660.
  30. O. Poursaeed, T. Jiang, H. Qiao, N. Xu, and V. G. Kim, “Self-supervised learning of point clouds via orientation estimation,” in International Conference on 3D Vision, 2020, pp. 1018–1028.
  31. J. Sauder and B. Sievers, “Self-supervised deep learning on point clouds by reconstructing space,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  32. Q. Zhang and J. Hou, “Pointvst: Self-supervised pre-training for 3d point clouds via view-specific point-to-image translation,” IEEE Transactions on Visualization and Computer Graphics, 2023.
  33. Q. Zhang, J. Hou, and Y. Qian, “Pointmcd: Boosting deep point cloud encoders via multi-view cross-modal distillation for 3d shape recognition,” IEEE Transactions on Multimedia, 2023.
  34. X. Yu, L. Tang, Y. Rao, T. Huang, J. Zhou, and J. Lu, “Point-bert: Pre-training 3d point cloud transformers with masked point modeling,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 19 313–19 322.
  35. Y. Guo, H. Wang, Q. Hu, H. Liu, L. Liu, and M. Bennamoun, “Deep learning for 3d point clouds: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 12, pp. 4338–4364, 2020.
  36. C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017, pp. 652–660.
  37. C. Choy, J. Gwak, and S. Savarese, “4d spatio-temporal convnets: Minkowski convolutional neural networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 3075–3084.
  38. H. Tang, Z. Liu, S. Zhao, Y. Lin, J. Lin, H. Wang, and S. Han, “Searching efficient 3d architectures with sparse point-voxel convolution,” in European Conference on Computer Vision, 2020, pp. 685–702.
  39. Y. Zhang, Q. Zhang, Z. Zhu, J. Hou, and Y. Yuan, “Glenet: Boosting 3d object detectors with generative label uncertainty estimation,” International Journal of Computer Vision, pp. 3332–3352, 2023.
  40. Y. Zhang, Q. Zhang, J. Hou, Y. Yuan, and G. Xing, “Unleash the potential of image branch for cross-modal 3d object detection,” in Advances in Neural Information Processing Systems, 2023.
  41. Y. Zhang, J. Hou, and Y. Yuan, “A comprehensive study of the robustness for lidar-based 3d object detectors against adversarial attacks,” International Journal of Computer Vision, 2023.
  42. Y. Zhang, Z. Zhu, J. Hou, and D. Wu, “Spatial-temporal enhanced transformer towards multi-frame 3d object detection,” arXiv preprint arXiv:2307.00347, 2023.
  43. Y. Zhou and O. Tuzel, “Voxelnet: End-to-end learning for point cloud based 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 4490–4499.
  44. Y. Yan, Y. Mao, and B. Li, “Second: Sparsely embedded convolutional detection,” Sensors, vol. 18, no. 10, p. 3337, 2018.
  45. S. Shi, C. Guo, L. Jiang, Z. Wang, J. Shi, X. Wang, and H. Li, “Pv-rcnn: Point-voxel feature set abstraction for 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 529–10 538.
  46. J. S. Hu, T. Kuai, and S. L. Waslander, “Point density-aware voxels for lidar 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 8469–8478.
  47. F. Hong, H. Zhou, X. Zhu, H. Li, and Z. Liu, “Lidar-based panoptic segmentation via dynamic shifting network,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 13 090–13 099.
  48. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems, vol. 30, 2017.
  49. S. Ren, Y. Zeng, J. Hou, and X. Chen, “Corri2p: Deep image-to-point cloud registration via dense correspondence,” IEEE Transactions on Circuits and Systems for Video Technology, pp. 1198–1208, 2022.
  50. A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
  51. V. Lepetit, F. Moreno-Noguer, and P. Fua, “Epnp: An accurate o(n) solution to the pnp problem,” International journal of computer vision, pp. 155–166, 2009.
  52. I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
  53. Loshchilov, Ilya and Hutter, Frank, “Sgdr: Stochastic gradient descent with warm restarts,” in International Conference on Learning Representations, 2016, pp. 1–16.
  54. D. Comaniciu and P. Meer, “Mean shift: A robust approach toward feature space analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 5, pp. 603–619, 2002.
  55. J. Li and G. H. Lee, “Deepi2p: Image-to-point cloud registration via deep classification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 15 960–15 969.
  56. J. Zhou, B. Ma, W. Zhang, Y. Fang, Y.-S. Liu, and Z. Han, “Differentiable registration of images and lidar point clouds with voxelpoint-to-pixel matching,” in Advances in Neural Information Processing Systems, 2023.
Citations (2)

Summary

We haven't generated a summary for this paper yet.