Sculpting Holistic 3D Representation in Contrastive Language-Image-3D Pre-training
Abstract: Contrastive learning has emerged as a promising paradigm for 3D open-world understanding, i.e., aligning point cloud representation to image and text embedding space individually. In this paper, we introduce MixCon3D, a simple yet effective method aiming to sculpt holistic 3D representation in contrastive language-image-3D pre-training. In contrast to point cloud only, we develop the 3D object-level representation from complementary perspectives, e.g., multi-view rendered images with the point cloud. Then, MixCon3D performs language-3D contrastive learning, comprehensively depicting real-world 3D objects and bolstering text alignment. Additionally, we pioneer the first thorough investigation of various training recipes for the 3D contrastive learning paradigm, building a solid baseline with improved performance. Extensive experiments conducted on three representative benchmarks reveal that our method significantly improves over the baseline, surpassing the previous state-of-the-art performance on the challenging 1,156-category Objaverse-LVIS dataset by 5.7%. The versatility of MixCon3D is showcased in applications such as text-to-3D retrieval and point cloud captioning, further evidencing its efficacy in diverse scenarios. The code is available at https://github.com/UCSC-VLAA/MixCon3D.
- Learning representations and generative models for 3d point clouds. In ICML. PMLR, 2018.
- Crosspoint: Self-supervised cross-modal contrastive learning for 3d point cloud understanding. In CVPR, 2022.
- Clipface: Text-guided editing of textured 3d morphable models. In SIGGRAPH, 2023.
- Transfusion: Robust lidar-camera fusion for 3d object detection with transformers. In CVPR, 2022.
- Text and image guided 3d avatar generation and manipulation. In WACV, 2023.
- Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
- BEVDistill: Cross-modal BEV distillation for multi-view 3d object detection. In ICLR, 2023.
- Reproducible scaling laws for contrastive language-image learning. In CVPR, 2023.
- 4d spatio-temporal convnets: Minkowski convolutional neural networks. In CVPR, 2019.
- Abo: Dataset and benchmarks for real-world 3d object understanding. In CVPR, 2022.
- Objaverse: A universe of annotated 3d objects. In CVPR, 2023.
- Ppf-foldnet: Unsupervised learning of rotation invariant 3d local descriptors. In ECCV, 2018.
- Pla: Language-driven open-vocabulary 3d scene understanding. In CVPR, 2023.
- Autoencoders as cross-modal teachers: Can pretrained 2d image transformers help 3d representation learning? arXiv preprint arXiv:2212.08320, 2022.
- 3d-future: 3d furniture shape with texture. IJCV, 2021.
- Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
- Open-vocabulary object detection via vision and language knowledge distillation. In ICLR, 2022.
- Lvis: A dataset for large vocabulary instance segmentation. In CVPR, 2019.
- Mvtn: Multi-view transformation network for 3d shape recognition. In ICCV, 2021.
- Voint cloud: Multi-view point cloud representation for 3d understanding. In ICLR, 2023.
- Clip goes 3d: Leveraging prompt tuning for language grounded 3d recognition. arXiv preprint arXiv:2303.11313, 2023.
- Avatarclip: zero-shot text-driven generation and animation of 3d avatars. ACM TOG, 2022.
- Ponder: Point cloud pre-training via neural rendering. In ICCV, 2023a.
- Clip2point: Transfer clip to point cloud classification with image-depth pre-training. In ICCV, 2023b.
- Zero-shot text-guided object generation with dream fields. In CVPR, 2022.
- Multi-view pointnet for 3d scene understanding. In ICCVW, 2019.
- Conceptfusion: Open-set multimodal 3d mapping. In ICRAW, 2023.
- Context-aware alignment and mutual masking for 3d-language pre-training. In CVPR, 2023.
- Stratified transformer for 3d point cloud segmentation. In CVPR, 2022.
- Vit-lens: Towards omni-modal representations. arXiv preprint arXiv:2308.10185, 2023.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML. PMLR, 2022a.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
- Simipu: Simple 2d image and 3d point cloud unsupervised pre-training for spatial-aware visual representations. In AAAI, 2022b.
- Rethinking clip-based video learners in cross-domain open-vocabulary action recognition. arXiv preprint arXiv:2403.01560, 2024.
- Multi-modal contrastive representation learning for entity alignment. In COLING, 2022.
- Openshape: Scaling up 3d shape representation towards open-world understanding. NeurIPS, 2024.
- Relation-shape convolutional neural network for point cloud analysis. In CVPR, 2019.
- Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In ICRA, 2023.
- Sgdr: Stochastic gradient descent with warm restarts. In ICLR, 2016.
- Decoupled weight decay regularization. In ICLR, 2018.
- Open-vocabulary point-cloud object detection without 3d annotation. In CVPR, 2023.
- Self-supervised point cloud prediction using 3d spatio-temporal convolutional networks. In CoRL, 2022.
- Text2mesh: Text-driven neural stylization for meshes. In CVPR, 2022.
- Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734, 2021.
- OpenAI. Gpt-4 technical report. arXiv:2303.08774, 2023.
- Masked autoencoders for point cloud self-supervised learning. In ECCV, 2022.
- Pytorch: An imperative style, high-performance deep learning library. NeurIPS, 2019.
- Openscene: 3d scene understanding with open vocabularies. In CVPR, 2023.
- Geometric multimodal contrastive representation learning. In ICML. PMLR, 2022.
- Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR, 2017a.
- Pointnet++: Deep hierarchical feature learning on point sets in a metric space. NeurIPS, 2017b.
- Contrast with reconstruct: Contrastive 3d representation learning guided by generative pretraining. In ICML, 2023.
- Pu-gcn: Point cloud upsampling using graph convolutional networks. In CVPR, 2021.
- Pointnext: Revisiting pointnet++ with improved training and scaling strategies. NeurIPS, 2022a.
- Pix4point: Image pretrained transformers for 3d point cloud understanding. 3DV, 2022b.
- Learning transferable visual models from natural language supervision. In ICML. PMLR, 2021.
- Global-local bidirectional reasoning for unsupervised representation learning of 3d point clouds. In CVPR, 2020.
- Clip-forge: Towards zero-shot text-to-shape generation. In CVPR, 2022.
- Self-supervised deep learning on point clouds by reconstructing space. NeurIPS, 2019.
- Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
- Laion-5b: An open large-scale dataset for training next generation image-text models. NeurIPS, 2022.
- Multi-view convolutional neural networks for 3d shape recognition. In ICCV, 2015.
- Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. NeurIPS, 2017.
- Tangent convolutions for dense prediction in 3d. In CVPR, 2018.
- Kpconv: Flexible and deformable convolution for point clouds. In ICCV, 2019.
- Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data. In ICCV, 2019.
- Beyond first impressions: Integrating joint multi-modal cues for comprehensive 3d representation. In ACM MM, 2023a.
- Graph attention convolution for point cloud semantic segmentation. In CVPR, 2019a.
- Dynamic graph cnn for learning on point clouds. ACM TOG, 2019b.
- P2p: Tuning pre-trained image models for point cloud analysis with point-to-pixel prompting. NeurIPS, 2022.
- Distillbev: Boosting multi-camera 3d object detection with cross-modal knowledge distillation. In ICCV, 2023b.
- 3d shapenets: A deep representation for volumetric shapes. In CVPR, 2015.
- Unsupervised feature learning via non-parametric instance discrimination. In CVPR, 2018.
- Pointcontrast: Unsupervised pre-training for 3d point cloud understanding. In ECCV, 2020.
- Dream3d: Zero-shot text-to-3d synthesis using 3d shape prior and text-to-image diffusion models. In CVPR, 2023.
- Spidercnn: Deep learning on point sets with parameterized convolutional filters. In ECCV, 2018.
- Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding. In CVPR, 2023a.
- Ulip-2: Towards scalable multimodal pre-training for 3d understanding. arXiv preprint arXiv:2305.08275, 2023b.
- Regionplc: Regional point-language contrastive learning for open-world 3d scene understanding. arXiv preprint arXiv:2304.00962, 2023.
- Point-bert: Pre-training 3d point cloud transformers with masked point modeling. In CVPR, 2022.
- Clip2: Contrastive language-image-point pretraining from real-world point cloud data. In CVPR, 2023.
- A simple framework for open-vocabulary segmentation and detection. In ICCV, 2023a.
- Clip-fo3d: Learning free open-world 3d scene representations from 2d dense clip. In ICCVW, 2023b.
- Pointclip: Point cloud understanding by clip. In CVPR, 2022a.
- Tip-adapter: Training-free adaption of clip for few-shot classification. In ECCV, 2022b.
- Learning 3d representations from 2d pre-trained models via image-to-point masked autoencoders. In CVPR, 2023c.
- Tamm: Triadapter multi-modal learning for 3d shape understanding. arXiv preprint arXiv:2402.18490, 2024.
- Point transformer. In ICCV, 2021.
- Actionhub: A large-scale action video description dataset for zero-shot action recognition. arXiv preprint arXiv:2401.11654, 2024a.
- Uni3d: Exploring unified 3d representation at scale. In ICLR, 2024b.
- Pointclip v2: Adapting clip for powerful 3d open-world learning. arXiv preprint arXiv:2211.11682, 2022.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.