Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MV-CLIP: Multi-View CLIP for Zero-shot 3D Shape Recognition (2311.18402v3)

Published 30 Nov 2023 in cs.CV

Abstract: Large-scale pre-trained models have demonstrated impressive performance in vision and language tasks within open-world scenarios. Due to the lack of comparable pre-trained models for 3D shapes, recent methods utilize language-image pre-training to realize zero-shot 3D shape recognition. However, due to the modality gap, pretrained language-image models are not confident enough in the generalization to 3D shape recognition. Consequently, this paper aims to improve the confidence with view selection and hierarchical prompts. Leveraging the CLIP model as an example, we employ view selection on the vision side by identifying views with high prediction confidence from multiple rendered views of a 3D shape. On the textual side, the strategy of hierarchical prompts is proposed for the first time. The first layer prompts several classification candidates with traditional class-level descriptions, while the second layer refines the prediction based on function-level descriptions or further distinctions between the candidates. Remarkably, without the need for additional training, our proposed method achieves impressive zero-shot 3D classification accuracies of 84.44%, 91.51%, and 66.17% on ModelNet40, ModelNet10, and ShapeNet Core55, respectively. Furthermore, we will make the code publicly available to facilitate reproducibility and further research in this area.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Visual prompting: Modifying pixel space to adapt pre-trained models. arXiv preprint arXiv:2203.17274 3 (2022), 11–12.
  2. Gary Bradski and Stephen Grossberg. 1994. Recognition of 3-d objects from multiple 2-d views by a self-organizing neural architecture. In From Statistics to Neural Networks: Theory and Pattern Recognition Applications. Springer, 349–375.
  3. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
  4. ShapeNet: An Information-Rich 3D Model Repository. CoRR abs/1512.03012 (2015). arXiv:1512.03012 http://arxiv.org/abs/1512.03012
  5. Zero-shot learning on 3d point cloud objects and beyond. International Journal of Computer Vision 130, 10 (2022), 2364–2384.
  6. Zero-shot learning of 3d point cloud objects. In 2019 16th International Conference on Machine Vision Applications (MVA). IEEE, 1–6.
  7. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13142–13153.
  8. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  9. Revisiting point cloud shape classification with a simple and effective baseline. In International Conference on Machine Learning. PMLR, 3809–3820.
  10. ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding. (2023), 15372–15383.
  11. MVTN: Learning Multi-View Transformations for 3D Understanding. arXiv:2212.13462 [cs.CV]
  12. Mvtn: Multi-view transformation network for 3d shape recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1–11.
  13. Clip goes 3d: Leveraging prompt tuning for language grounded 3d recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2028–2038.
  14. Clip2point: Transfer clip to point cloud classification with image-depth pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 22157–22167.
  15. Openclip, July 2021. If you use this software, please cite it as below 2, 4 ([n. d.]), 5.
  16. Visual prompt tuning. In European Conference on Computer Vision. Springer, 709–727.
  17. How can we know what language models know? Transactions of the Association for Computational Linguistics 8 (2020), 423–438.
  18. Rotationnet: Joint object categorization and pose estimation using multiviews from unsupervised viewpoints. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5010–5019.
  19. Maple: Multi-modal prompt learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19113–19122.
  20. OpenShape: Scaling Up 3D Shape Representation Towards Open-World Understanding. arXiv preprint arXiv:2305.10764 (2023).
  21. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. Comput. Surveys 55, 9 (2023), 1–35.
  22. Slip: Self-supervision meets language-image pre-training. In European Conference on Computer Vision. Springer, 529–544.
  23. Chils: Zero-shot image classification with hierarchical label sets. In International Conference on Machine Learning. PMLR, 26342–26362.
  24. What does a platypus look like? generating customized prompts for zero-shot image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 15691–15701.
  25. Contrast with Reconstruct: Contrastive 3D Representation Learning Guided by Generative Pretraining. arXiv preprint arXiv:2302.02318 (2023).
  26. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
  27. Sketch-A-Shape: Zero-Shot Sketch-to-3D Shape Generation. CoRR abs/2307.03869 (2023). https://doi.org/10.48550/ARXIV.2307.03869 arXiv:2307.03869
  28. DiffCLIP: Leveraging Stable Diffusion for Language Grounded 3D Classification. arXiv preprint arXiv:2305.15957 (2023).
  29. Multi-view convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE international conference on computer vision. 945–953.
  30. Semantically guided projection for zero-shot 3D model classification and retrieval. Multimedia Systems 28, 6 (2022), 2437–2451.
  31. Universal adversarial triggers for attacking and analyzing NLP. arXiv preprint arXiv:1908.07125 (2019).
  32. Beyond First Impressions: Integrating Joint Multi-modal Cues for Comprehensive 3D Representation. In Proceedings of the 31st ACM International Conference on Multimedia. 3403–3414.
  33. View-gcn: View-based graph convolutional network for 3d shape analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1850–1859.
  34. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1912–1920.
  35. ULIP: Learning a unified representation of language, images, and point clouds for 3D understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1179–1189.
  36. ULIP-2: Towards Scalable Multimodal Pre-training For 3D Understanding. arXiv preprint arXiv:2305.08275 (2023).
  37. Point-bert: Pre-training 3d point cloud transformers with masked point modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19313–19322.
  38. CLIP2: Contrastive Language-Image-Point Pretraining from Real-World Point Cloud Data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15244–15253.
  39. Clip-fo3d: Learning free open-world 3d scene representations from 2d dense clip. arXiv preprint arXiv:2303.04748 (2023).
  40. Point-m2ae: multi-scale masked autoencoders for hierarchical point cloud pre-training. Advances in neural information processing systems 35 (2022), 27061–27074.
  41. PointCLIP: Point Cloud Understanding by CLIP. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. IEEE, 8542–8552. https://doi.org/10.1109/CVPR52688.2022.00836
  42. Prompt, generate, then cache: Cascade of foundation models makes strong few-shot learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15211–15222.
  43. Learning 3d representations from 2d pre-trained models via image-to-point masked autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21769–21780.
  44. Point transformer. In Proceedings of the IEEE/CVF international conference on computer vision. 16259–16268.
  45. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16816–16825.
  46. Learning to prompt for vision-language models. International Journal of Computer Vision 130, 9 (2022), 2337–2348.
  47. Pointclip v2: Prompting clip and gpt for powerful 3d open-world learning. (2023), 2639–2650.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Dan Song (28 papers)
  2. Xinwei Fu (6 papers)
  3. Weizhi Nie (20 papers)
  4. Wenhui Li (41 papers)
  5. Lanjun Wang (36 papers)
  6. You Yang (6 papers)
  7. Anan Liu (20 papers)
  8. Ning Liu (199 papers)
Citations (4)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com