Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 164 tok/s
Gemini 2.5 Pro 46 tok/s Pro
GPT-5 Medium 21 tok/s Pro
GPT-5 High 27 tok/s Pro
GPT-4o 72 tok/s Pro
Kimi K2 204 tok/s Pro
GPT OSS 120B 450 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

Point Clouds Are Specialized Images: A Knowledge Transfer Approach for 3D Understanding (2307.15569v2)

Published 28 Jul 2023 in cs.CV

Abstract: Self-supervised representation learning (SSRL) has gained increasing attention in point cloud understanding, in addressing the challenges posed by 3D data scarcity and high annotation costs. This paper presents PCExpert, a novel SSRL approach that reinterprets point clouds as "specialized images". This conceptual shift allows PCExpert to leverage knowledge derived from large-scale image modality in a more direct and deeper manner, via extensively sharing the parameters with a pre-trained image encoder in a multi-way Transformer architecture. The parameter sharing strategy, combined with a novel pretext task for pre-training, i.e., transformation estimation, empowers PCExpert to outperform the state of the arts in a variety of tasks, with a remarkable reduction in the number of trainable parameters. Notably, PCExpert's performance under LINEAR fine-tuning (e.g., yielding a 90.02% overall accuracy on ScanObjectNN) has already approached the results obtained with FULL model fine-tuning (92.66%), demonstrating its effective and robust representation capability.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Crosspoint: Self-supervised cross-modal contrastive learning for 3d point cloud understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9902–9912, 2022.
  2. Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. Advances in Neural Information Processing Systems, 35:32897–32912, 2022.
  3. Semantickitti: A dataset for semantic scene understanding of lidar sequences. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9297–9307, 2019.
  4. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
  5. Autoencoders as cross-modal teachers: Can pretrained 2d image transformers help 3d representation learning? In The Eleventh International Conference on Learning Representations, 2023.
  6. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  7. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232–5270, 2022.
  8. Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1179–1189, 2023.
  9. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15180–15190, 2023.
  10. Lidarclip or: How i learned to talk to point clouds. arXiv preprint arXiv:2212.06858, 2022.
  11. Frozen clip model is efficient point cloud backbone. arXiv preprint arXiv:2212.04098, 2022.
  12. Cross-modal center loss for 3d cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3142–3151, 2021.
  13. Toward extracting and exploiting generalizable knowledge of deep 2d transformations in computer vision. Neurocomputing, 562:126882, 2023.
  14. A closer look at invariances in self-supervised pre-training for 3d vision. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXX, pages 656–673. Springer, 2022.
  15. Pointcnn: Convolution on x-transformed points. Advances in neural information processing systems, 31, 2018.
  16. Simipu: Simple 2d image and 3d point cloud unsupervised pre-training for spatial-aware visual representations. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 1500–1508, 2022.
  17. Learning from 2d: Contrastive pixel-to-point knowledge transfer for 3d pretraining. arXiv preprint arXiv:2104.04687, 2021.
  18. Decoupled weight decay regularization. In International Conference on Learning Representations, 2018.
  19. Rethinking network design and local geometry in point cloud: A simple residual mlp framework. In International Conference on Learning Representations, 2021.
  20. Masked autoencoders for point cloud self-supervised learning. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part II, pages 604–621. Springer, 2022.
  21. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660, 2017.
  22. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems, 30, 2017.
  23. Contrast with reconstruct: Contrastive 3d representation learning guided by generative pretraining. In International Conference on Machine Learning (ICML), 2023.
  24. Pointnext: Revisiting pointnet++ with improved training and scaling strategies. Advances in Neural Information Processing Systems, 35:23192–23204, 2022.
  25. Pix4point: Image pretrained transformers for 3d point cloud understanding. arXiv preprint arXiv:2208.12259, 2022.
  26. Geometric back-projection network for point cloud classification. IEEE Transactions on Multimedia, 24:1943–1955, 2021.
  27. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  28. A survey on lidar scanning mechanisms. Electronics, 9(5):741, 2020.
  29. Accelerating 3d deep learning with pytorch3d. arXiv:2007.08501, 2020.
  30. Efficient 3d scene semantic segmentation via active learning on rendered 2d images. IEEE Transactions on Image Processing, 2023.
  31. Vipformer: Efficient vision-and-pointcloud transformer for unsupervised pointcloud understanding. arXiv preprint arXiv:2303.14376, 2023.
  32. Point-lgmask: Local and global contexts embedding for point cloud pre-training with multi-ratio masking. IEEE Transactions on Multimedia, 2023.
  33. Self-supervised learning with multi-view rendering for 3d point cloud analysis. In Proceedings of the Asian Conference on Computer Vision, pages 3086–3103, 2022.
  34. Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1588–1597, 2019.
  35. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  36. Unsupervised point cloud pre-training via occlusion completion. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9782–9792, 2021.
  37. Image as a foreign language: Beit pretraining for vision and vision-language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19175–19186, 2023.
  38. Dynamic graph cnn for learning on point clouds. Acm Transactions On Graphics (tog), 38(5):1–12, 2019.
  39. P2p: Tuning pre-trained image models for point cloud analysis with point-to-pixel prompting. Advances in neural information processing systems, 35:14388–14402, 2022.
  40. Self-supervised intra-modal and cross-modal contrastive learning for point cloud understanding. IEEE Transactions on Multimedia, 2023.
  41. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1912–1920, 2015.
  42. Image2point: 3d point-cloud understanding with 2d image pretrained models. In European Conference on Computer Vision, pages 638–656. Springer, 2022.
  43. Semantic navigation of powerpoint-based lecture video for autonote generation. IEEE Transactions on Learning Technologies, 16(1):1–17, 2022.
  44. Disn: Deep implicit surface network for high-quality single-view 3d reconstruction. Advances in neural information processing systems, 32, 2019.
  45. Point-bert: Pre-training 3d point cloud transformers with masked point modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19313–19322, 2022.
  46. Clip-fo3d: Learning free open-world 3d scene representations from 2d dense clip. arXiv preprint arXiv:2303.04748, 2023.
  47. Point-m2ae: Multi-scale masked autoencoders for hierarchical point cloud pre-training. In Advances in Neural Information Processing Systems, 2022.
  48. Pointclip: Point cloud understanding by clip. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8552–8562, 2022.
  49. Self-supervised pretraining of 3d features on any point-cloud. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10252–10263, 2021.
  50. Pointcmc: Cross-modal multi-scale correspondences learning for point cloud understanding. arXiv preprint arXiv:2211.12032, 2022.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.