Papers
Topics
Authors
Recent
Search
2000 character limit reached

Sculpting Holistic 3D Representation in Contrastive Language-Image-3D Pre-training

Published 3 Nov 2023 in cs.CV | (2311.01734v2)

Abstract: Contrastive learning has emerged as a promising paradigm for 3D open-world understanding, i.e., aligning point cloud representation to image and text embedding space individually. In this paper, we introduce MixCon3D, a simple yet effective method aiming to sculpt holistic 3D representation in contrastive language-image-3D pre-training. In contrast to point cloud only, we develop the 3D object-level representation from complementary perspectives, e.g., multi-view rendered images with the point cloud. Then, MixCon3D performs language-3D contrastive learning, comprehensively depicting real-world 3D objects and bolstering text alignment. Additionally, we pioneer the first thorough investigation of various training recipes for the 3D contrastive learning paradigm, building a solid baseline with improved performance. Extensive experiments conducted on three representative benchmarks reveal that our method significantly improves over the baseline, surpassing the previous state-of-the-art performance on the challenging 1,156-category Objaverse-LVIS dataset by 5.7%. The versatility of MixCon3D is showcased in applications such as text-to-3D retrieval and point cloud captioning, further evidencing its efficacy in diverse scenarios. The code is available at https://github.com/UCSC-VLAA/MixCon3D.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (91)
  1. Learning representations and generative models for 3d point clouds. In ICML. PMLR, 2018.
  2. Crosspoint: Self-supervised cross-modal contrastive learning for 3d point cloud understanding. In CVPR, 2022.
  3. Clipface: Text-guided editing of textured 3d morphable models. In SIGGRAPH, 2023.
  4. Transfusion: Robust lidar-camera fusion for 3d object detection with transformers. In CVPR, 2022.
  5. Text and image guided 3d avatar generation and manipulation. In WACV, 2023.
  6. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
  7. BEVDistill: Cross-modal BEV distillation for multi-view 3d object detection. In ICLR, 2023.
  8. Reproducible scaling laws for contrastive language-image learning. In CVPR, 2023.
  9. 4d spatio-temporal convnets: Minkowski convolutional neural networks. In CVPR, 2019.
  10. Abo: Dataset and benchmarks for real-world 3d object understanding. In CVPR, 2022.
  11. Objaverse: A universe of annotated 3d objects. In CVPR, 2023.
  12. Ppf-foldnet: Unsupervised learning of rotation invariant 3d local descriptors. In ECCV, 2018.
  13. Pla: Language-driven open-vocabulary 3d scene understanding. In CVPR, 2023.
  14. Autoencoders as cross-modal teachers: Can pretrained 2d image transformers help 3d representation learning? arXiv preprint arXiv:2212.08320, 2022.
  15. 3d-future: 3d furniture shape with texture. IJCV, 2021.
  16. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
  17. Open-vocabulary object detection via vision and language knowledge distillation. In ICLR, 2022.
  18. Lvis: A dataset for large vocabulary instance segmentation. In CVPR, 2019.
  19. Mvtn: Multi-view transformation network for 3d shape recognition. In ICCV, 2021.
  20. Voint cloud: Multi-view point cloud representation for 3d understanding. In ICLR, 2023.
  21. Clip goes 3d: Leveraging prompt tuning for language grounded 3d recognition. arXiv preprint arXiv:2303.11313, 2023.
  22. Avatarclip: zero-shot text-driven generation and animation of 3d avatars. ACM TOG, 2022.
  23. Ponder: Point cloud pre-training via neural rendering. In ICCV, 2023a.
  24. Clip2point: Transfer clip to point cloud classification with image-depth pre-training. In ICCV, 2023b.
  25. Zero-shot text-guided object generation with dream fields. In CVPR, 2022.
  26. Multi-view pointnet for 3d scene understanding. In ICCVW, 2019.
  27. Conceptfusion: Open-set multimodal 3d mapping. In ICRAW, 2023.
  28. Context-aware alignment and mutual masking for 3d-language pre-training. In CVPR, 2023.
  29. Stratified transformer for 3d point cloud segmentation. In CVPR, 2022.
  30. Vit-lens: Towards omni-modal representations. arXiv preprint arXiv:2308.10185, 2023.
  31. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML. PMLR, 2022a.
  32. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  33. Simipu: Simple 2d image and 3d point cloud unsupervised pre-training for spatial-aware visual representations. In AAAI, 2022b.
  34. Rethinking clip-based video learners in cross-domain open-vocabulary action recognition. arXiv preprint arXiv:2403.01560, 2024.
  35. Multi-modal contrastive representation learning for entity alignment. In COLING, 2022.
  36. Openshape: Scaling up 3d shape representation towards open-world understanding. NeurIPS, 2024.
  37. Relation-shape convolutional neural network for point cloud analysis. In CVPR, 2019.
  38. Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In ICRA, 2023.
  39. Sgdr: Stochastic gradient descent with warm restarts. In ICLR, 2016.
  40. Decoupled weight decay regularization. In ICLR, 2018.
  41. Open-vocabulary point-cloud object detection without 3d annotation. In CVPR, 2023.
  42. Self-supervised point cloud prediction using 3d spatio-temporal convolutional networks. In CoRL, 2022.
  43. Text2mesh: Text-driven neural stylization for meshes. In CVPR, 2022.
  44. Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734, 2021.
  45. OpenAI. Gpt-4 technical report. arXiv:2303.08774, 2023.
  46. Masked autoencoders for point cloud self-supervised learning. In ECCV, 2022.
  47. Pytorch: An imperative style, high-performance deep learning library. NeurIPS, 2019.
  48. Openscene: 3d scene understanding with open vocabularies. In CVPR, 2023.
  49. Geometric multimodal contrastive representation learning. In ICML. PMLR, 2022.
  50. Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR, 2017a.
  51. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. NeurIPS, 2017b.
  52. Contrast with reconstruct: Contrastive 3d representation learning guided by generative pretraining. In ICML, 2023.
  53. Pu-gcn: Point cloud upsampling using graph convolutional networks. In CVPR, 2021.
  54. Pointnext: Revisiting pointnet++ with improved training and scaling strategies. NeurIPS, 2022a.
  55. Pix4point: Image pretrained transformers for 3d point cloud understanding. 3DV, 2022b.
  56. Learning transferable visual models from natural language supervision. In ICML. PMLR, 2021.
  57. Global-local bidirectional reasoning for unsupervised representation learning of 3d point clouds. In CVPR, 2020.
  58. Clip-forge: Towards zero-shot text-to-shape generation. In CVPR, 2022.
  59. Self-supervised deep learning on point clouds by reconstructing space. NeurIPS, 2019.
  60. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
  61. Laion-5b: An open large-scale dataset for training next generation image-text models. NeurIPS, 2022.
  62. Multi-view convolutional neural networks for 3d shape recognition. In ICCV, 2015.
  63. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. NeurIPS, 2017.
  64. Tangent convolutions for dense prediction in 3d. In CVPR, 2018.
  65. Kpconv: Flexible and deformable convolution for point clouds. In ICCV, 2019.
  66. Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data. In ICCV, 2019.
  67. Beyond first impressions: Integrating joint multi-modal cues for comprehensive 3d representation. In ACM MM, 2023a.
  68. Graph attention convolution for point cloud semantic segmentation. In CVPR, 2019a.
  69. Dynamic graph cnn for learning on point clouds. ACM TOG, 2019b.
  70. P2p: Tuning pre-trained image models for point cloud analysis with point-to-pixel prompting. NeurIPS, 2022.
  71. Distillbev: Boosting multi-camera 3d object detection with cross-modal knowledge distillation. In ICCV, 2023b.
  72. 3d shapenets: A deep representation for volumetric shapes. In CVPR, 2015.
  73. Unsupervised feature learning via non-parametric instance discrimination. In CVPR, 2018.
  74. Pointcontrast: Unsupervised pre-training for 3d point cloud understanding. In ECCV, 2020.
  75. Dream3d: Zero-shot text-to-3d synthesis using 3d shape prior and text-to-image diffusion models. In CVPR, 2023.
  76. Spidercnn: Deep learning on point sets with parameterized convolutional filters. In ECCV, 2018.
  77. Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding. In CVPR, 2023a.
  78. Ulip-2: Towards scalable multimodal pre-training for 3d understanding. arXiv preprint arXiv:2305.08275, 2023b.
  79. Regionplc: Regional point-language contrastive learning for open-world 3d scene understanding. arXiv preprint arXiv:2304.00962, 2023.
  80. Point-bert: Pre-training 3d point cloud transformers with masked point modeling. In CVPR, 2022.
  81. Clip2: Contrastive language-image-point pretraining from real-world point cloud data. In CVPR, 2023.
  82. A simple framework for open-vocabulary segmentation and detection. In ICCV, 2023a.
  83. Clip-fo3d: Learning free open-world 3d scene representations from 2d dense clip. In ICCVW, 2023b.
  84. Pointclip: Point cloud understanding by clip. In CVPR, 2022a.
  85. Tip-adapter: Training-free adaption of clip for few-shot classification. In ECCV, 2022b.
  86. Learning 3d representations from 2d pre-trained models via image-to-point masked autoencoders. In CVPR, 2023c.
  87. Tamm: Triadapter multi-modal learning for 3d shape understanding. arXiv preprint arXiv:2402.18490, 2024.
  88. Point transformer. In ICCV, 2021.
  89. Actionhub: A large-scale action video description dataset for zero-shot action recognition. arXiv preprint arXiv:2401.11654, 2024a.
  90. Uni3d: Exploring unified 3d representation at scale. In ICLR, 2024b.
  91. Pointclip v2: Adapting clip for powerful 3d open-world learning. arXiv preprint arXiv:2211.11682, 2022.
Citations (4)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.