Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 189 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 36 tok/s Pro
GPT-5 High 36 tok/s Pro
GPT-4o 75 tok/s Pro
Kimi K2 160 tok/s Pro
GPT OSS 120B 443 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

EPCL: Frozen CLIP Transformer is An Efficient Point Cloud Encoder (2212.04098v3)

Published 8 Dec 2022 in cs.CV

Abstract: The pretrain-finetune paradigm has achieved great success in NLP and 2D image fields because of the high-quality representation ability and transferability of their pretrained models. However, pretraining such a strong model is difficult in the 3D point cloud field due to the limited amount of point cloud sequences. This paper introduces \textbf{E}fficient \textbf{P}oint \textbf{C}loud \textbf{L}earning (EPCL), an effective and efficient point cloud learner for directly training high-quality point cloud models with a frozen CLIP transformer. Our EPCL connects the 2D and 3D modalities by semantically aligning the image features and point cloud features without paired 2D-3D data. Specifically, the input point cloud is divided into a series of local patches, which are converted to token embeddings by the designed point cloud tokenizer. These token embeddings are concatenated with a task token and fed into the frozen CLIP transformer to learn point cloud representation. The intuition is that the proposed point cloud tokenizer projects the input point cloud into a unified token space that is similar to the 2D images. Comprehensive experiments on 3D detection, semantic segmentation, classification and few-shot learning demonstrate that the CLIP transformer can serve as an efficient point cloud encoder and our method achieves promising performance on both indoor and outdoor benchmarks. In particular, performance gains brought by our EPCL are $\textbf{19.7}$ AP$_{50}$ on ScanNet V2 detection, $\textbf{4.4}$ mIoU on S3DIS segmentation and $\textbf{1.2}$ mIoU on SemanticKITTI segmentation compared to contemporary pretrained models. Code is available at \url{https://github.com/XiaoshuiHuang/EPCL}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198.
  2. 3d semantic parsing of large-scale indoor spaces. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1534–1543.
  3. Semantickitti: A dataset for semantic scene understanding of lidar sequences. In Proceedings of the IEEE/CVF international conference on computer vision, 9297–9307.
  4. Emerging Properties in Self-Supervised Vision Transformers. In Proceedings of the International Conference on Computer Vision (ICCV).
  5. A short note on the kinetics-700 human action dataset. arXiv preprint arXiv:1907.06987.
  6. ShapeNet: An Information-Rich 3D Model Repository. Technical Report arXiv:1512.03012 [cs.GR], Stanford University — Princeton University — Toyota Technological Institute at Chicago.
  7. Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 397–406.
  8. Pointmixer: Mlp-mixer for point cloud understanding. In European Conference on Computer Vision, 620–640. Springer.
  9. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
  10. 4d spatio-temporal convnets: Minkowski convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3075–3084.
  11. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, 5828–5839.
  12. Autoencoders as Cross-Modal Teachers: Can Pretrained 2D Image Transformers Help 3D Representation Learning? In The Eleventh International Conference on Learning Representations.
  13. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR.
  14. Ppt: Pre-trained prompt tuning for few-shot learning. arXiv preprint arXiv:2109.04332.
  15. Pct: Point cloud transformer. Computational Visual Media, 7(2): 187–199.
  16. Point-to-voxel knowledge distillation for lidar semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 8479–8488.
  17. Visual prompt tuning. arXiv preprint arXiv:2203.12119.
  18. Segment Anything. arXiv:2304.02643.
  19. Rethinking range view representation for lidar segmentation. arXiv preprint arXiv:2303.05367.
  20. Do better imagenet models transfer better? In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2661–2671.
  21. Predicting deep zero-shot convolutional neural networks using textual descriptions. In Proceedings of the IEEE international conference on computer vision, 4247–4255.
  22. Frozen CLIP Models are Efficient Video Learners. In European Conference on Computer Vision, 388–404. Springer.
  23. Masked Discrimination for Self-Supervised Learning on Point Clouds. arXiv preprint arXiv:2203.11183.
  24. P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv preprint arXiv:2110.07602.
  25. Learning from 2d: Contrastive pixel-to-point knowledge transfer for 3d pretraining. arXiv preprint arXiv:2104.04687.
  26. Unsupervised Point Cloud Pre-Training Via Contrasting and Clustering. In 2022 IEEE International Conference on Image Processing (ICIP), 66–70. IEEE.
  27. An end-to-end transformer model for 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2906–2917.
  28. Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734.
  29. Intriguing properties of vision transformers. Advances in Neural Information Processing Systems, 34: 23296–23308.
  30. Masked autoencoders for point cloud self-supervised learning. arXiv preprint arXiv:2203.06604.
  31. How do vision transformers work? arXiv preprint arXiv:2202.06709.
  32. Pressley, A. N. 2010. Elementary differential geometry. Springer Science & Business Media.
  33. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, 652–660.
  34. Pix4Point: Image Pretrained Transformers for 3D Point Cloud Understanding. arXiv preprint arXiv:2208.12259.
  35. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 8748–8763. PMLR.
  36. Do imagenet classifiers generalize to imagenet? In International Conference on Machine Learning, 5389–5400. PMLR.
  37. Octnet: Learning deep 3d representations at high resolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, 3577–3586.
  38. How Much Can CLIP Benefit Vision-and-Language Tasks? In International Conference on Learning Representations.
  39. Kpconv: Flexible and deformable convolution for point clouds. In Proceedings of the IEEE/CVF international conference on computer vision, 6411–6420.
  40. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, 3156–3164.
  41. Softgroup for 3d instance segmentation on point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2708–2717.
  42. CAGroup3D: Class-Aware Grouping for 3D Object Detection on Point Clouds. arXiv preprint arXiv:2210.04264.
  43. Unsupervised point cloud pre-training via occlusion completion. In Proceedings of the IEEE/CVF international conference on computer vision, 9782–9792.
  44. Can We Solve 3D Vision Tasks Starting from A 2D Vision Transformer? arXiv preprint arXiv:2209.07026.
  45. Dynamic graph cnn for learning on point clouds. Acm Transactions On Graphics (tog), 38(5): 1–12.
  46. P2p: Tuning pre-trained image models for point cloud analysis with point-to-pixel prompting. arXiv preprint arXiv:2208.02812.
  47. Take-A-Photo: 3D-to-2D Generative Pre-training of Point Cloud Models.
  48. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1912–1920.
  49. Pointcontrast: Unsupervised pre-training for 3d point cloud understanding. In European conference on computer vision, 574–591. Springer.
  50. Image2point: 3d point-cloud understanding with 2d image pretrained models. In European Conference on Computer Vision, 638–656. Springer.
  51. Rpvnet: A deep and efficient range-point-voxel fusion network for lidar point cloud segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 16024–16033.
  52. 2dpass: 2d priors assisted semantic segmentation on lidar point clouds. In European Conference on Computer Vision, 677–695. Springer.
  53. LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark. arXiv preprint arXiv:2306.06687.
  54. Point-bert: Pre-training 3d point cloud transformers with masked point modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 19313–19322.
  55. Pointclip: Point cloud understanding by clip. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8552–8562.
  56. Point transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 16259–16268.
  57. Cylinder3d: An effective 3d framework for driving-scene lidar semantic segmentation. arXiv preprint arXiv:2008.01550.
Citations (7)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Github Logo Streamline Icon: https://streamlinehq.com