PointMCD: Boosting Deep Point Cloud Encoders via Multi-view Cross-modal Distillation for 3D Shape Recognition (2207.03128v4)
Abstract: As two fundamental representation modalities of 3D objects, 3D point clouds and multi-view 2D images record shape information from different domains of geometric structures and visual appearances. In the current deep learning era, remarkable progress in processing such two data modalities has been achieved through respectively customizing compatible 3D and 2D network architectures. However, unlike multi-view image-based 2D visual modeling paradigms, which have shown leading performance in several common 3D shape recognition benchmarks, point cloud-based 3D geometric modeling paradigms are still highly limited by insufficient learning capacity, due to the difficulty of extracting discriminative features from irregular geometric signals. In this paper, we explore the possibility of boosting deep 3D point cloud encoders by transferring visual knowledge extracted from deep 2D image encoders under a standard teacher-student distillation workflow. Generally, we propose PointMCD, a unified multi-view cross-modal distillation architecture, including a pretrained deep image encoder as the teacher and a deep point encoder as the student. To perform heterogeneous feature alignment between 2D visual and 3D geometric domains, we further investigate visibility-aware feature projection (VAFP), by which point-wise embeddings are reasonably aggregated into view-specific geometric descriptors. By pair-wisely aligning multi-view visual and geometric descriptors, we can obtain more powerful deep point encoders without exhausting and complicated network modification. Experiments on 3D shape classification, part segmentation, and unsupervised learning strongly validate the effectiveness of our method. The code and data will be publicly available at https://github.com/keeganhk/PointMCD.
- M. A. Uy and G. H. Lee, “Pointnetvlad: Deep point cloud based retrieval for large-scale place recognition,” in Proc. CVPR, 2018, pp. 4470–4479.
- B.-S. Hua, Q.-H. Pham, D. T. Nguyen, M.-K. Tran, L.-F. Yu, and S.-K. Yeung, “Scenenn: A scene meshes dataset with annotations,” in Proc. 3DV, 2016, pp. 92–101.
- I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. Brilakis, M. Fischer, and S. Savarese, “3d semantic parsing of large-scale indoor spaces,” in Proc. CVPR, 2016, pp. 1534–1543.
- A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” in Proc. CVPR, 2017, pp. 5828–5839.
- T. Hackel, N. Savinov, L. Ladicky, J. D. Wegner, K. Schindler, and M. Pollefeys, “Semantic3d.net: A new large-scale point cloud classification benchmark,” ISPRS, 2017.
- X. Roynard, J.-E. Deschaud, and F. Goulette, “Paris-lille-3d: A large and high-quality ground-truth urban point cloud dataset for automatic segmentation and classification,” IJRR, vol. 37, no. 6, pp. 545–557, 2018.
- W. Tan, N. Qin, L. Ma, Y. Li, J. Du, G. Cai, K. Yang, and J. Li, “Toronto-3d: A large-scale mobile lidar dataset for semantic segmentation of urban roadways,” in Proc. CVPRW, 2020, pp. 202–203.
- A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in Proc. CVPR, 2012, pp. 3354–3361.
- P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik, P. Tsui, J. Guo, Y. Zhou, Y. Chai, B. Caine et al., “Scalability in perception for autonomous driving: Waymo open dataset,” in Proc. CVPR, 2020, pp. 2446–2454.
- H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” in Proc. CVPR, 2020, pp. 11 621–11 631.
- D. Maturana and S. Scherer, “Voxnet: A 3d convolutional neural network for real-time object recognition,” in Proc. IROS, 2015, pp. 922–928.
- Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, “3d shapenets: A deep representation for volumetric shapes,” in Proc. CVPR, 2015, pp. 1912–1920.
- G. Riegler, A. Osman Ulusoy, and A. Geiger, “Octnet: Learning deep 3d representations at high resolutions,” in Proc. CVPR, 2017, pp. 3577–3586.
- P.-S. Wang, Y. Liu, Y.-X. Guo, C.-Y. Sun, and X. Tong, “O-cnn: Octree-based convolutional neural networks for 3d shape analysis,” ACM Trans. Graph., vol. 36, no. 4, pp. 1–11, 2017.
- R. Klokov and V. Lempitsky, “Escape from cells: Deep kd-networks for the recognition of 3d point cloud models,” in Proc. ICCV, 2017, pp. 863–872.
- H. Su, S. Maji, E. Kalogerakis, and E. Learned-Miller, “Multi-view convolutional neural networks for 3d shape recognition,” in Proc. CVPR, 2015, pp. 945–953.
- Y. Feng, Z. Zhang, X. Zhao, R. Ji, and Y. Gao, “Gvcnn: Group-view convolutional neural networks for 3d shape recognition,” in Proc. CVPR, 2018, pp. 264–272.
- T. Yu, J. Meng, and J. Yuan, “Multi-view harmonized bilinear network for 3d object recognition,” in Proc. CVPR, 2018, pp. 186–194.
- A. Kanezaki, Y. Matsushita, and Y. Nishida, “Rotationnet: Joint object categorization and pose estimation using multiviews from unsupervised viewpoints,” in Proc. CVPR, 2018, pp. 5010–5019.
- Z. Yang and L. Wang, “Learning relationships for multi-view 3d object recognition,” in Proc. ICCV, 2019, pp. 7505–7514.
- C. Esteves, Y. Xu, C. Allen-Blanchette, and K. Daniilidis, “Equivariant multi-view networks,” in Proc. ICCV, 2019, pp. 1568–1577.
- X. Wei, R. Yu, and J. Sun, “View-gcn: View-based graph convolutional network for 3d shape analysis,” in Proc. CVPR, 2020, pp. 1850–1859.
- A. Hamdi, S. Giancola, and B. Ghanem, “Mvtn: Multi-view transformation network for 3d shape recognition,” in Proc. ICCV, 2021, pp. 1–11.
- C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in Proc. CVPR, 2017, pp. 652–660.
- C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,” in Proc. NeurIPS, 2017, pp. 5105–5114.
- Y. Xu, T. Fan, M. Xu, L. Zeng, and Y. Qiao, “Spidercnn: Deep learning on point sets with parameterized convolutional filters,” in Proc. ECCV, 2018, pp. 87–102.
- Y. Li, R. Bu, M. Sun, W. Wu, X. Di, and B. Chen, “Pointcnn: Convolution on χ𝜒\chiitalic_χ-transformed points,” in Proc. NeurIPS, 2018, pp. 828–838.
- Y. Liu, B. Fan, S. Xiang, and C. Pan, “Relation-shape convolutional neural network for point cloud analysis,” in Proc. CVPR, 2019, pp. 8895–8904.
- H. Thomas, C. R. Qi, J.-E. Deschaud, B. Marcotegui, F. Goulette, and L. J. Guibas, “Kpconv: Flexible and deformable convolution for point clouds,” in Proc. ICCV, 2019, pp. 6411–6420.
- Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon, “Dynamic graph cnn for learning on point clouds,” ACM Trans. Graph., vol. 38, no. 5, pp. 1–12, 2019.
- H. Zhao, L. Jiang, J. Jia, P. H. Torr, and V. Koltun, “Point transformer,” in Proc. ICCV, 2021, pp. 16 259–16 268.
- T. Xiang, C. Zhang, Y. Song, J. Yu, and W. Cai, “Walk in the cloud: Learning curves for point clouds shape analysis,” in Proc. ICCV, 2021, pp. 915–924.
- X. Ma, C. Qin, H. You, H. Ran, and Y. Fu, “Rethinking network design and local geometry in point cloud: A simple residual MLP framework,” in Proc. ICLR, 2022.
- K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proc. ICLR, 2015.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. CVPR, 2016, pp. 770–778.
- J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proc. CVPR, 2015, pp. 3431–3440.
- O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Proc. MICCAI, 2015, pp. 234–241.
- A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Proc. NeurIPS, vol. 25, 2012, pp. 1097–1105.
- T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Proc. ECCV, 2014, pp. 740–755.
- M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in Proc. CVPR, 2016, pp. 3213–3223.
- A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su et al., “Shapenet: An information-rich 3d model repository,” arXiv preprint arXiv:1512.03012, 2015.
- J. Wang, L. Zhang, and Z. Yi, “Mixture 2d convolutions for 3d medical image segmentation.” Int. J. Neural Syst, pp. 2 250 059–2 250 059, 2022.
- T. Si, F. He, Z. Zhang, and Y. Duan, “Hybrid contrastive learning for unsupervised person re-identification,” IEEE Trans. Multimedia, 2022.
- P. de Oliveira Rente, C. Brites, J. Ascenso, and F. Pereira, “Graph-based static 3d point clouds geometry coding,” IEEE Trans. Multimedia, vol. 21, no. 2, pp. 284–299, 2018.
- L. Li, Z. Li, S. Liu, and H. Li, “Efficient projected frame padding for video-based point cloud compression,” IEEE Trans. Multimedia, vol. 23, pp. 2806–2819, 2020.
- M. Zhang, H. You, P. Kadam, S. Liu, and C.-C. J. Kuo, “Pointhop: An explainable machine learning method for point cloud classification,” IEEE Trans. Multimedia, vol. 22, no. 7, pp. 1744–1755, 2020.
- H. Liu, Y. Guo, Y. Ma, Y. Lei, and G. Wen, “Semantic context encoding for accurate 3d point cloud segmentation,” IEEE Trans. Multimedia, vol. 23, pp. 2045–2055, 2020.
- D. Valsesia, G. Fracastoro, and E. Magli, “Learning localized representations of point clouds with graph-convolutional generative adversarial networks,” IEEE Trans. Multimedia, vol. 23, pp. 402–414, 2020.
- C. Chen, S. Qian, Q. Fang, and C. Xu, “Hapgn: Hierarchical attentive pooling graph network for point cloud segmentation,” IEEE Trans. Multimedia, vol. 23, pp. 2335–2346, 2020.
- A. F. Guarda, N. M. Rodrigues, and F. Pereira, “Constant size point cloud clustering: A compact, non-overlapping solution,” IEEE Trans. Multimedia, vol. 23, pp. 77–91, 2020.
- S. Qiu, S. Anwar, and N. Barnes, “Geometric back-projection network for point cloud classification,” IEEE Trans. Multimedia, vol. 24, pp. 1943–1955, 2021.
- B.-S. Hua, M.-K. Tran, and S.-K. Yeung, “Pointwise convolutional neural networks,” in Proc. CVPR, 2018, pp. 984–993.
- N. Verma, E. Boyer, and J. Verbeek, “Feastnet: Feature-steered graph convolutions for 3d shape analysis,” in Proc. CVPR, 2018, pp. 2598–2606.
- M. Xu, R. Ding, H. Zhao, and X. Qi, “Paconv: Position adaptive convolution with dynamic kernel assembling on point clouds,” in Proc. CVPR, 2021, pp. 3173–3182.
- J. Li, B. M. Chen, and G. H. Lee, “So-net: Self-organizing network for point cloud analysis,” in Proc. CVPR, 2018, pp. 9397–9406.
- W. Wu, Z. Qi, and L. Fuxin, “Pointconv: Deep convolutional networks on 3d point clouds,” in Proc. CVPR, 2019, pp. 9621–9630.
- Z. Zhang, B.-S. Hua, and S.-K. Yeung, “Shellnet: Efficient point cloud convolutional neural networks using concentric shells statistics,” in Proc. ICCV, 2019, pp. 1607–1616.
- J. Yang, Q. Zhang, B. Ni, L. Li, J. Liu, M. Zhou, and Q. Tian, “Modeling point clouds with self-attention and gumbel subset sampling,” in Proc. CVPR, 2019, pp. 3323–3332.
- E. Nezhadarya, E. Taghavi, R. Razani, B. Liu, and J. Luo, “Adaptive hierarchical down-sampling for point cloud classification,” in Proc. CVPR, 2020, pp. 12 956–12 964.
- X. Yan, C. Zheng, Z. Li, S. Wang, and S. Cui, “Pointasnl: Robust point clouds processing using nonlocal neural networks with adaptive sampling,” in Proc. CVPR, 2020, pp. 5589–5598.
- L. Hui, H. Yang, M. Cheng, J. Xie, and J. Yang, “Pyramid point cloud transformer for large-scale place recognition,” in Proc. ICCV, 2021, pp. 6098–6107.
- K. Mazur and V. Lempitsky, “Cloud transformers: A universal approach to point cloud processing tasks,” in Proc. ICCV, 2021, pp. 10 715–10 724.
- M. Krivokuca, E. Miandji, C. Guillemot, and P. Chou, “Compression of plenoptic point cloud attributes using 6-d point clouds and 6-d transforms,” IEEE Trans. Multimedia, 2021.
- S. Jiayao, S. Zhou, Y. Cui, and Z. Fang, “Real-time 3d single object tracking with transformer,” IEEE Trans. Multimedia, 2022.
- X. Yu, L. Tang, Y. Rao, T. Huang, J. Zhou, and J. Lu, “Point-bert: Pre-training 3d point cloud transformers with masked point modeling,” in Proc. CVPR, 2022, pp. 19 313–19 322.
- Y. Pang, W. Wang, F. E. Tay, W. Liu, Y. Tian, and L. Yuan, “Masked autoencoders for point cloud self-supervised learning,” in Proc. ECCV, 2022, pp. 604–621.
- G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
- J. Gou, B. Yu, S. J. Maybank, and D. Tao, “Knowledge distillation: A survey,” International Journal of Computer Vision, vol. 129, no. 6, pp. 1789–1819, 2021.
- M. Jaritz, T.-H. Vu, R. d. Charette, E. Wirbel, and P. Pérez, “xmuda: Cross-modal unsupervised domain adaptation for 3d semantic segmentation,” in Proc. CVPR, 2020, pp. 12 605–12 614.
- Y.-C. Liu, Y.-K. Huang, H.-Y. Chiang, H.-T. Su, Z.-Y. Liu, C.-T. Chen, C.-Y. Tseng, and W. H. Hsu, “Learning from 2d: Pixel-to-point knowledge transfer for 3d pretraining,” arXiv preprint arXiv:2104.04687, 2021.
- J. Hou, S. Xie, B. Graham, A. Dai, and M. Nießner, “Pri3d: Can 3d priors help 2d representation learning?” in Proc. ICCV, 2021, pp. 5693–5702.
- Z. Liu, X. Qi, and C.-W. Fu, “3d-to-2d distillation for indoor scene parsing,” in Proc. CVPR, 2021, pp. 4464–4474.
- J. Behley, M. Garbade, A. Milioto, J. Quenzel, S. Behnke, C. Stachniss, and J. Gall, “Semantickitti: A dataset for semantic scene understanding of lidar sequences,” in Proc. ICCV, 2019, pp. 9297–9307.
- J. Xu, R. Zhang, J. Dou, Y. Zhu, J. Sun, and S. Pu, “Rpvnet: A deep and efficient range-point-voxel fusion network for lidar point cloud segmentation,” in Proc. ICCV, 2021, pp. 16 024–16 033.
- C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proc. CVPR, 2016, pp. 2818–2826.
- M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proc. CVPR, 2018, pp. 4510–4520.
- S. Katz, A. Tal, and R. Basri, “Direct visibility of point sets,” in ACM SIGGRAPH, 2007, pp. 24–es.
- A. Brock, T. Lim, J. M. Ritchie, and N. Weston, “Generative and discriminative voxel modeling with convolutional neural networks,” arXiv preprint arXiv:1608.04236, 2016.
- J.-C. Su, M. Gadelha, R. Wang, and S. Maji, “A deeper look at 3d shape classifiers,” in Proc. ECCVW, 2018, pp. 0–0.
- L. Yi, V. G. Kim, D. Ceylan, I.-C. Shen, M. Yan, H. Su, C. Lu, Q. Huang, A. Sheffer, and L. Guibas, “A scalable active framework for region annotation in 3d shape collections,” ACM Trans. Graph., vol. 35, no. 6, pp. 1–12, 2016.
- S. Huang, Y. Xie, S.-C. Zhu, and Y. Zhu, “Spatio-temporal self-supervised representation learning for 3d point clouds,” in Proc. ICCV, 2021, pp. 6535–6545.
- M. Afham, I. Dissanayake, D. Dissanayake, A. Dharmasiri, K. Thilakarathna, and R. Rodrigo, “Crosspoint: Self-supervised cross-modal contrastive learning for 3d point cloud understanding,” in Proc. CVPR, 2022, pp. 9902–9912.
- Y. Chen, J. Liu, B. Ni, H. Wang, J. Yang, N. Liu, T. Li, and Q. Tian, “Shape self-correction for unsupervised point cloud understanding,” in Proc. ICCV, 2021, pp. 8382–8391.
- B. Eckart, W. Yuan, C. Liu, and J. Kautz, “Self-supervised learning on 3d point clouds by learning discrete generative models,” in Proc. CVPR, 2021, pp. 8248–8257.
- Y. Yang, C. Feng, Y. Shen, and D. Tian, “Foldingnet: Point cloud auto-encoder via deep grid deformation,” in Proc. CVPR, 2018, pp. 206–215.
- R. Girdhar, D. F. Fouhey, M. Rodriguez, and A. Gupta, “Learning a predictable and generative vector representation for objects,” in Proc. ECCV, 2016, pp. 484–499.
- A. Sharma, O. Grau, and M. Fritz, “Vconv-dae: Deep volumetric shape learning without object labels,” in Proc. ECCVW, 2016, pp. 236–250.
- J. Wu, C. Zhang, T. Xue, B. Freeman, and J. Tenenbaum, “Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling,” Proc. NeurIPS, vol. 29, 2016.
- Z. Han, M. Shang, Y.-S. Liu, and M. Zwicker, “View inter-prediction gan: Unsupervised representation learning for 3d shapes by learning global shape memories to support local view predictions,” in Proc. AAAI, vol. 33, no. 01, 2019, pp. 8376–8384.
- P. Achlioptas, O. Diamanti, I. Mitliagkas, and L. Guibas, “Learning representations and generative models for 3d point clouds,” in Proc. ICML, 2018, pp. 40–49.
- Y. Zhao, T. Birdal, H. Deng, and F. Tombari, “3d point capsule networks,” in Proc. CVPR, 2019, pp. 1009–1018.
- B. Du, X. Gao, W. Hu, and X. Li, “Self-contrastive learning with hard negative sampling for self-supervised point cloud learning,” in Proc. ACM MM, 2021, pp. 3133–3142.
- S. Chen, C. Duan, Y. Yang, D. Li, C. Feng, and D. Tian, “Deep unsupervised learning of 3d point clouds via graph topology inference and filtering,” IEEE Trans. Image Process, vol. 29, pp. 3183–3198, 2019.
- H. Chen, S. Luo, X. Gao, and W. Hu, “Unsupervised learning of geometric sampling invariant representations for 3d point clouds,” in Proc. ICCV, 2021, pp. 893–903.
- A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio, “Fitnets: Hints for thin deep nets,” in Proc. ICLR, 2015.
- M. A. Uy, Q.-H. Pham, B.-S. Hua, T. Nguyen, and S.-K. Yeung, “Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data,” in Proc. ICCV, 2019, pp. 1588–1597.
- M. Carranza-García, F. J. Galán-Sales, and J. M. Luna-Romera, “Object detection using depth completion and camera-lidar fusion for autonomous driving,” Integr Comput Aided Eng, no. Preprint, pp. 1–18, 2022.
- W. Tang, F. He, and Y. Liu, “Ydtr: infrared and visible image fusion via y-shape dynamic transformer,” IEEE Trans. Multimedia, 2022.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in Proc. ICML, 2021, pp. 8748–8763.
- Y. Li, F. Liang, L. Zhao, Y. Cui, W. Ouyang, J. Shao, F. Yu, and J. Yan, “Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm,” in Proc. ICLR, 2022.
- Qijian Zhang (20 papers)
- Junhui Hou (138 papers)
- Yue Qian (14 papers)