Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DiffCLIP: Leveraging Stable Diffusion for Language Grounded 3D Classification (2305.15957v3)

Published 25 May 2023 in cs.CV

Abstract: Large pre-trained models have had a significant impact on computer vision by enabling multi-modal learning, where the CLIP model has achieved impressive results in image classification, object detection, and semantic segmentation. However, the model's performance on 3D point cloud processing tasks is limited due to the domain gap between depth maps from 3D projection and training images of CLIP. This paper proposes DiffCLIP, a new pre-training framework that incorporates stable diffusion with ControlNet to minimize the domain gap in the visual branch. Additionally, a style-prompt generation module is introduced for few-shot tasks in the textual branch. Extensive experiments on the ModelNet10, ModelNet40, and ScanObjectNN datasets show that DiffCLIP has strong abilities for 3D understanding. By using stable diffusion and style-prompt generation, DiffCLIP achieves an accuracy of 43.2\% for zero-shot classification on OBJ_BG of ScanObjectNN, which is state-of-the-art performance, and an accuracy of 80.6\% for zero-shot classification on ModelNet10, which is comparable to state-of-the-art performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  2. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  3. Pointclip: Point cloud understanding by clip. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8552–8562, 2022.
  4. Pointclip v2: Adapting clip for powerful 3d open-world learning. arXiv preprint arXiv:2211.11682, 2022.
  5. Clip2point: Transfer clip to point cloud classification with image-depth pre-training. arXiv preprint arXiv:2210.01055, 2022.
  6. Clip goes 3d: Leveraging prompt tuning for language grounded 3d recognition. arXiv preprint arXiv:2303.11313, 2023.
  7. Clip^ 2: Contrastive language-image-point pretraining from real-world point cloud data. arXiv preprint arXiv:2303.12417, 2023.
  8. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
  9. Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543, 2023.
  10. End-to-end learning local multi-view descriptors for 3d point clouds. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1919–1928, 2020.
  11. Pvnet: A joint convolutional network of point cloud and multi-view for 3d shape recognition. In Proceedings of the 26th ACM international conference on Multimedia, pages 1310–1318, 2018.
  12. Voxnet: A 3d convolutional neural network for real-time object recognition. In 2015 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 922–928. IEEE, 2015.
  13. Octnet: Learning deep 3d representations at high resolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3577–3586, 2017.
  14. Pct: Point cloud transformer. Computational Visual Media, 7:187–199, 2021.
  15. Point transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 16259–16268, 2021.
  16. Multi-view convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE international conference on computer vision, pages 945–953, 2015.
  17. Multi-view harmonized bilinear network for 3d object recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 186–194, 2018.
  18. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660, 2017.
  19. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems, 30, 2017.
  20. Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7959–7971, 2022.
  21. Clip-adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544, 2021.
  22. Tip-adapter: Training-free clip-adapter for better vision-language modeling. arXiv preprint arXiv:2111.03930, 2021.
  23. Test-time prompt tuning for zero-shot generalization in vision-language models. arXiv preprint arXiv:2209.07511, 2022.
  24. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022.
  25. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16816–16825, 2022.
  26. Clipood: Generalizing clip to out-of-distributions. arXiv preprint arXiv:2302.00864, 2023.
  27. Stylip: Multi-scale style-conditioned prompt learning for clip-based domain generalization. arXiv preprint arXiv:2302.09251, 2023.
  28. Language-guided semantic style transfer of 3d indoor scenes. In Proceedings of the 1st Workshop on Photorealistic Image and Environment Synthesis for Multimedia Experiments, pages 11–17, 2022.
  29. Domain-specific mappings for generative adversarial style transfer. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VIII 16, pages 573–589. Springer, 2020.
  30. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  31. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021.
  32. Photorealistic text-to-image diffusion models with deep language understanding. In Advances in Neural Information Processing Systems, pages 36479–36494, 2022.
  33. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  34. Visualizing and understanding convolutional networks. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13, pages 818–833. Springer, 2014.
  35. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  36. Paul Werbos. Beyond regression: New tools for prediction and analysis in the behavioral sciences. PhD thesis, Committee on Applied Mathematics, Harvard University, Cambridge, MA, 1974.
  37. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  38. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
  39. Contrast with reconstruct: Contrastive 3d representation learning guided by generative pretraining. arXiv preprint arXiv:2302.02318, 2023.
  40. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1912–1920, 2015.
  41. Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1588–1597, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Sitian Shen (2 papers)
  2. Zilin Zhu (4 papers)
  3. Linqian Fan (1 paper)
  4. Harry Zhang (37 papers)
  5. Xinxiao Wu (21 papers)
Citations (24)