Open-Pose 3D Zero-Shot Learning: Benchmark and Challenges (2312.07039v2)
Abstract: With the explosive 3D data growth, the urgency of utilizing zero-shot learning to facilitate data labeling becomes evident. Recently, methods transferring language or language-image pre-training models like Contrastive Language-Image Pre-training (CLIP) to 3D vision have made significant progress in the 3D zero-shot classification task. These methods primarily focus on 3D object classification with an aligned pose; such a setting is, however, rather restrictive, which overlooks the recognition of 3D objects with open poses typically encountered in real-world scenarios, such as an overturned chair or a lying teddy bear. To this end, we propose a more realistic and challenging scenario named open-pose 3D zero-shot classification, focusing on the recognition of 3D objects regardless of their orientation. First, we revisit the current research on 3D zero-shot classification, and propose two benchmark datasets specifically designed for the open-pose setting. We empirically validate many of the most popular methods in the proposed open-pose benchmark. Our investigations reveal that most current 3D zero-shot classification models suffer from poor performance, indicating a substantial exploration room towards the new direction. Furthermore, we study a concise pipeline with an iterative angle refinement mechanism that automatically optimizes one ideal angle to classify these open-pose 3D objects. In particular, to make validation more compelling and not just limited to existing CLIP-based methods, we also pioneer the exploration of knowledge transfer based on Diffusion models. While the proposed solutions can serve as a new benchmark for open-pose 3D zero-shot classification, we discuss the complexities and challenges of this scenario that remain for further research development. The code is available publicly at https://github.com/weiguangzhao/Diff-OP3D.
- Language models are few-shot learners. In NeurIPS, pages 1877–1901, 2020.
- Mutex parts collaborative network for 3d point cloud zero-shot classification. SSRN.
- Viewnet: A novel projection-based backbone with view pooling for few-shot point cloud classification. In CVPR, pages 17652–17660, 2023.
- Mitigating the hubness problem for zero-shot learning of 3d objects. In BMVC, pages 41–53, 2019a.
- Zero-shot learning of 3d point cloud objects. In MVA, pages 1–6, 2019b.
- Transductive zero-shot learning for 3d point cloud classification. In WACV, pages 923–933, 2020.
- Zero-shot learning on 3d point cloud objects and beyond. IJCV, 130(10):2364–2384, 2022.
- 4d spatio-temporal convnets: Minkowski convolutional neural networks. In CVPR, pages 3075–3084, 2019.
- Text-to-image diffusion models are zero-shot classifiers. arXiv preprint arXiv:2303.15233, 2023.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- Deep learning for 3d point clouds: A survey. IEEE TPAMI, 43(12):4338–4364, 2020.
- Semantic contrastive embedding for generalized zero-shot learning. IJCV, 130(11):2606–2622, 2022.
- Contrastive generative network with recursive-loop for 3d point cloud generalized zero-shot classification. PR, 2023.
- Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
- Clip goes 3d: Leveraging prompt tuning for language grounded 3d recognition. In ICCV, pages 2028–2038, 2023.
- Denoising diffusion probabilistic models. NeurIPS, 33:6840–6851, 2020.
- Clip2point: Transfer clip to point cloud classification with image-depth pre-training. In ICCV, pages 22157–22167, 2023.
- Imagenet classification with deep convolutional neural networks. NeurIPS, 25, 2012.
- Zero-data learning of new tasks. In AAAI, pages 646–651, 2008.
- Your diffusion model is secretly a zero-shot classifier. In ICCV, 2023.
- Voxnet: A 3d convolutional neural network for real-time object recognition. In IROS, pages 922–928, 2015.
- Generative zero-shot learning for semantic segmentation of 3d point clouds. In 3DV, pages 992–1002, 2021.
- Distributed representations of words and phrases and their compositionality. NeurIPS, 26, 2013.
- PartNet: A large-scale benchmark for fine-grained and hierarchical part-level 3D object understanding. In CVPR, 2019.
- 3d compositional zero-shot learning with decompositional consensus. In ECCV, pages 713–730, 2022.
- I2mvformer: Large language model generated multi-view document supervision for zero-shot image classification. In CVPR, pages 15169–15179, 2023.
- Latent embedding feedback and discriminative features for zero-shot classification. In ECCV, pages 479–495, 2020.
- Dilf: Differentiable rendering-based multi-view image–language fusion for zero-shot 3d shape understanding. Information Fusion, 102:102033, 2024.
- Glove: Global vectors for word representation. In EMNLP, pages 1532–1543, 2014.
- Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR, pages 652–660, 2017.
- Contrast with reconstruct: Contrastive 3d representation learning guided by generative pretraining. In ICML, 2023.
- Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763, 2021.
- Zero-shot object detection: joint recognition and localization of novel concepts. IJCV, 128(12):2979–2999, 2020.
- Accelerating 3d deep learning with pytorch3d. arXiv:2007.08501, 2020.
- High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
- Generalized zero-and few-shot learning via aligned variational autoencoders. In CVPR, pages 8247–8255, 2019.
- Retrieving articulated 3-d models using medial surfaces. MVA, 19:261–275, 2008.
- Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, pages 2256–2265, 2015.
- Multi-view convolutional neural networks for 3d shape recognition. In ICCV, pages 945–953, 2015.
- A survey of zero-shot learning: Settings, methods, and applications. TIST, 10(2):1–37, 2019.
- Transferring clip’s knowledge into zero-shot point cloud semantic segmentation. In ACM MM, pages 3745–3754, 2023.
- Point transformer v2: Grouped vector attention and partition-based pooling. NeurIPS, 35:33330–33342, 2022.
- 3d shapenets: A deep representation for volumetric shapes. In CVPR, pages 1912–1920, 2015.
- Zero-shot learning-a comprehensive evaluation of the good, the bad and the ugly. IEEE TPAMI, 41(9):2251–2265, 2018.
- Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding. In CVPR, pages 1179–1189, 2023a.
- Ulip-2: Towards scalable multimodal pre-training for 3d understanding. arXiv preprint arXiv:2305.08275, 2023b.
- Semantics-guided intra-category knowledge transfer for generalized zero-shot learning. IJCV, 131(6):1331–1345, 2023.
- A comprehensive survey of zero-shot image classification: methods, implementation, and fair evaluation. AIMS-ACI, 2(1):1–31, 2022.
- Learning relationships for multi-view 3d object recognition. In ICCV, pages 7505–7514, 2019.
- Disentangling semantic-to-visual confusion for zero-shot learning. IEEE TMM, 24:2828–2840, 2021.
- Rebalanced zero-shot learning. IEEE TIP, 2023.
- Pointclip: Point cloud understanding by CLIP. In CVPR, pages 8542–8552, 2022.
- Point transformer. In CVPR, pages 16259–16268, 2021.
- Information bottleneck and selective noise supervision for zero-shot learning. ML, pages 1–23, 2022.
- Pointclip v2: Adapting clip for powerful 3d open-world learning. In ICCV, 2023.
- Weiguang Zhao (10 papers)
- Guanyu Yang (32 papers)
- Chaolong Yang (5 papers)
- Chenru Jiang (5 papers)
- Yuyao Yan (12 papers)
- Rui Zhang (1138 papers)
- Kaizhu Huang (95 papers)
- Amir Hussain (75 papers)