OpenShape: Scaling Up 3D Shape Representation Towards Open-World Understanding (2305.10764v2)
Abstract: We introduce OpenShape, a method for learning multi-modal joint representations of text, image, and point clouds. We adopt the commonly used multi-modal contrastive learning framework for representation alignment, but with a specific focus on scaling up 3D representations to enable open-world 3D shape understanding. To achieve this, we scale up training data by ensembling multiple 3D datasets and propose several strategies to automatically filter and enrich noisy text descriptions. We also explore and compare strategies for scaling 3D backbone networks and introduce a novel hard negative mining module for more efficient training. We evaluate OpenShape on zero-shot 3D classification benchmarks and demonstrate its superior capabilities for open-world recognition. Specifically, OpenShape achieves a zero-shot accuracy of 46.8% on the 1,156-category Objaverse-LVIS benchmark, compared to less than 10% for existing methods. OpenShape also achieves an accuracy of 85.3% on ModelNet40, outperforming previous zero-shot baseline methods by 20% and performing on par with some fully-supervised methods. Furthermore, we show that our learned embeddings encode a wide range of visual and semantic concepts (e.g., subcategories, color, shape, style) and facilitate fine-grained text-3D and image-3D interactions. Due to their alignment with CLIP embeddings, our learned shape representations can also be integrated with off-the-shelf CLIP-based models for various applications, such as point cloud captioning and point cloud-conditioned image generation.
- Satr: Zero-shot semantic segmentation of 3d shapes. arXiv preprint arXiv:2304.04909, 2023.
- Self-supervised learning for domain adaptation on point clouds. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 123–133, 2021.
- Learning representations and generative models for 3d point clouds. In International conference on machine learning, pages 40–49. PMLR, 2018.
- Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198, 2022.
- Clipface: Text-guided editing of textured 3d morphable models. arXiv preprint arXiv:2212.01406, 2022.
- Romain Beaumont. Clip retrieval: Easily compute clip embeddings and build a clip retrieval system with them. https://github.com/rom1504/clip-retrieval, 2022.
- Text and image guided 3d avatar generation and manipulation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 4421–4431, 2023.
- Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
- Clip2scene: Towards label-efficient 3d scene understanding by clip. arXiv preprint arXiv:2301.04926, 2023.
- 4d spatio-temporal convnets: Minkowski convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3075–3084, 2019.
- Abo: Dataset and benchmarks for real-world 3d object understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21126–21136, 2022.
- Objaverse: A universe of annotated 3d objects. arXiv preprint arXiv:2212.08051, 2022.
- Ppf-foldnet: Unsupervised learning of rotation invariant 3d local descriptors. In Proceedings of the European conference on computer vision (ECCV), pages 602–618, 2018.
- Language-driven open-vocabulary 3d scene understanding. arXiv preprint arXiv:2211.16312, 2022.
- Self-supervised learning on 3d point clouds by learning discrete generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8248–8257, 2021.
- 3d-future: 3d furniture shape with texture. International Journal of Computer Vision, 129:3313–3337, 2021.
- Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5356–5364, 2019.
- Semantic abstraction: Open-world 3d scene understanding from 2d vision-language models. In Conference on Robot Learning, 2022.
- Clip goes 3d: Leveraging prompt tuning for language grounded 3d recognition. arXiv preprint arXiv:2303.11313, 2023.
- Masked autoencoder for self-supervised pre-training on lidar point clouds. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 350–359, 2023.
- Lidarclip or: How i learned to talk to point clouds. arXiv preprint arXiv:2212.06858, 2022.
- Avatarclip: Zero-shot text-driven generation and animation of 3d avatars. arXiv preprint arXiv:2205.08535, 2022.
- Joint representation learning for text and 3d point cloud. arXiv preprint arXiv:2301.07584, 2023.
- Clip2point: Transfer clip to point cloud classification with image-depth pre-training. arXiv preprint arXiv:2210.01055, 2022.
- Openclip, July 2021. If you use this software, please cite it as below.
- Zero-shot text-guided object generation with dream fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 867–876, 2022.
- Conceptfusion: Open-set multimodal 3d mapping. arXiv preprint arXiv:2302.07241, 2023.
- Nikolay Jetchev. Clipmatrix: Text-controlled creation of 3d textured meshes. arXiv preprint arXiv:2109.12922, 2021.
- Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR, 2021.
- Hard negative mixing for contrastive learning. Advances in Neural Information Processing Systems, 33:21798–21809, 2020.
- Lerf: Language embedded radiance fields. arXiv preprint arXiv:2303.09553, 2023.
- Text to mesh without 3d supervision using limit subdivision. arXiv preprint arXiv:2203.13333, 2022.
- Understanding pure clip guidance for voxel grid nerf models. arXiv preprint arXiv:2209.15172, 2022.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
- Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975, 2022.
- Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. Advances in Neural Information Processing Systems, 35:17612–17625, 2022.
- Partslip: Low-shot part segmentation for 3d point clouds via pretrained image-language models. arXiv preprint arXiv:2212.01558, 2022.
- Iss: Image as stetting stone for text-guided 3d shape generation. arXiv preprint arXiv:2209.04145, 2022.
- Open-vocabulary point-cloud object detection without 3d annotation. arXiv preprint arXiv:2304.00788, 2023.
- Rethinking network design and local geometry in point cloud: A simple residual mlp framework. arXiv preprint arXiv:2202.07123, 2022.
- Voxnet: A 3d convolutional neural network for real-time object recognition. In 2015 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 922–928. IEEE, 2015.
- Self-supervised point cloud prediction using 3d spatio-temporal convolutional networks. In Conference on Robot Learning, pages 1444–1454. PMLR, 2022.
- Text2mesh: Text-driven neural stylization for meshes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13492–13502, 2022.
- Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734, 2021.
- OpenAI. Gpt-4 technical report, 2023.
- Masked autoencoders for point cloud self-supervised learning. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part II, pages 604–621. Springer, 2022.
- Openscene: 3d scene understanding with open vocabularies. arXiv preprint arXiv:2211.15654, 2022.
- Self-supervised learning of point clouds via orientation estimation. In 2020 International Conference on 3D Vision (3DV), pages 1018–1028. IEEE, 2020.
- Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660, 2017.
- Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems, 30, 2017.
- Contrast with reconstruct: Contrastive 3d representation learning guided by generative pretraining. arXiv preprint arXiv:2302.02318, 2023.
- Pointnext: Revisiting pointnet++ with improved training and scaling strategies. arXiv:2206.04670, 2022.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- Global-local bidirectional reasoning for unsupervised representation learning of 3d point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5376–5385, 2020.
- Contrastive learning with hard negative samples. arXiv preprint arXiv:2010.04592, 2020.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
- Language-grounded indoor 3d semantic segmentation in the wild. arXiv preprint arXiv:2204.07761, 2022.
- Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022.
- Aditya Sanghi. Info3d: Representation learning on 3d objects using mutual information maximization and contrastive learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIX 16, pages 626–642. Springer, 2020.
- Clip-forge: Towards zero-shot text-to-shape generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18603–18613, 2022.
- Self-supervised deep learning on point clouds by reconstructing space. Advances in Neural Information Processing Systems, 32, 2019.
- Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022.
- Self-supervised few-shot learning on point clouds. Advances in Neural Information Processing Systems, 33:7212–7221, 2020.
- Point cloud pre-training by mixing and disentangling. arXiv e-prints, pages arXiv–2109, 2021.
- Self-supervised learning of local features in 3d point clouds. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 938–939, 2020.
- Vishaal Udandarao. Understanding and fixing the modality gap in vision-language models. Master’s thesis, University of Cambridge, 2022.
- Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1588–1597, 2019.
- Unsupervised point cloud pre-training via occlusion completion. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9782–9792, 2021.
- Dynamic graph cnn for learning on point clouds. Acm Transactions On Graphics (tog), 38(5):1–12, 2019.
- Sim2real 3d object classification using spherical kernel point convolution and a deep center voting scheme. arXiv preprint arXiv:2103.06134, 2021.
- 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1912–1920, 2015.
- Pointcontrast: Unsupervised pre-training for 3d point cloud understanding. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, pages 574–591. Springer, 2020.
- Dream3d: Zero-shot text-to-3d synthesis using 3d shape prior and text-to-image diffusion models. arXiv preprint arXiv:2212.14704, 2022.
- Ulip: Learning unified representation of language, image and point cloud for 3d understanding. arXiv preprint arXiv:2212.05171, 2022.
- Regionplc: Regional point-language contrastive learning for open-world 3d scene understanding. arXiv preprint arXiv:2304.00962, 2023.
- Point-bert: Pre-training 3d point cloud transformers with masked point modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19313–19322, 2022.
- Point-bert: Pre-training 3d point cloud transformers with masked point modeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Clip^ 2: Contrastive language-image-point pretraining from real-world point cloud data. arXiv preprint arXiv:2303.12417, 2023.
- Glipv2: Unifying localization and vision-language understanding. arXiv preprint arXiv:2206.05836, 2022.
- Clip-fo3d: Learning free open-world 3d scene representations from 2d dense clip. arXiv preprint arXiv:2303.04748, 2023.
- Pointclip: Point cloud understanding by clip. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8552–8562, 2022.
- Self-supervised pretraining of 3d features on any point-cloud. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10252–10263, 2021.
- Pointclip v2: Adapting clip for powerful 3d open-world learning. arXiv preprint arXiv:2211.11682, 2022.
- Minghua Liu (22 papers)
- Ruoxi Shi (20 papers)
- Kaiming Kuang (11 papers)
- Yinhao Zhu (14 papers)
- Xuanlin Li (18 papers)
- Shizhong Han (26 papers)
- Hong Cai (51 papers)
- Fatih Porikli (141 papers)
- Hao Su (218 papers)