TAMM: TriAdapter Multi-Modal Learning for 3D Shape Understanding (2402.18490v2)
Abstract: The limited scale of current 3D shape datasets hinders the advancements in 3D shape understanding, and motivates multi-modal learning approaches which transfer learned knowledge from data-abundant 2D image and language modalities to 3D shapes. However, even though the image and language representations have been aligned by cross-modal models like CLIP, we find that the image modality fails to contribute as much as the language in existing multi-modal 3D representation learning methods. This is attributed to the domain shift in the 2D images and the distinct focus of each modality. To more effectively leverage both modalities in the pre-training, we introduce TriAdapter Multi-Modal Learning (TAMM) -- a novel two-stage learning approach based on three synergistic adapters. First, our CLIP Image Adapter mitigates the domain gap between 3D-rendered images and natural images, by adapting the visual representations of CLIP for synthetic image-text pairs. Subsequently, our Dual Adapters decouple the 3D shape representation space into two complementary sub-spaces: one focusing on visual attributes and the other for semantic understanding, which ensure a more comprehensive and effective multi-modal pre-training. Extensive experiments demonstrate that TAMM consistently enhances 3D representations for a wide range of 3D encoder architectures, pre-training datasets, and downstream tasks. Notably, we boost the zero-shot classification accuracy on Objaverse-LVIS from 46.8\% to 50.7\%, and improve the 5-way 10-shot linear probing classification accuracy on ModelNet40 from 96.1\% to 99.0\%. Project page: https://alanzhangcs.github.io/tamm-page.
- Abien Fred Agarap. Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375, 2018.
- Language models are few-shot learners. NeurIPS, 2020.
- Emerging properties in self-supervised vision transformers. In ICCV, 2021.
- ShapeNet: An information-rich 3D model repository. arXiv preprint arXiv:1512.03012, 2015.
- A simple framework for contrastive learning of visual representations. In ICML, 2020.
- Reproducible scaling laws for contrastive language-image learning. In CVPR, 2023.
- 4D spatio-temporal ConvNets: Minkowski convolutional neural networks. In CVPR, 2019.
- ABO: Dataset and benchmarks for real-world 3D object understanding. In CVPR, 2022.
- ScanNet: Richly-annotated 3D reconstructions of indoor scenes. In CVPR, 2017.
- Objaverse: A universe of annotated 3D objects. In CVPR, 2023.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2020.
- 3D-FUTURE: 3D furniture shape with texture. IJCV, 129:3313–3337, 2021.
- CLIP-Adapter: Better vision-language models with feature adapters. IJCV, 2023.
- LVIS: A dataset for large vocabulary instance segmentation. In CVPR, 2019.
- Semantic abstraction: Open-world 3D scene understanding from 2D vision-language models. In CoRL, 2022.
- Momentum contrast for unsupervised visual representation learning. In CVPR, 2020.
- Gaussian error linear units (GELUs). arXiv preprint arXiv:1606.08415, 2016.
- AvatarCLIP: Zero-shot text-driven generation and animation of 3D avatars. ACM Transactions on Graphics, 41(4):1–19, 2022.
- Parameter-efficient transfer learning for NLP. In ICML, 2019.
- CLIP2Point: Transfer CLIP to point cloud classification with image-depth pre-training. In ICCV, 2023.
- Zero-shot text-guided object generation with dream fields. In CVPR, 2022.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022a.
- Grounded language-image pre-training. In CVPR, 2022b.
- OpenShape: Scaling up 3D shape representation towards open-world understanding. In NeurIPS, 2023.
- Decoupled weight decay regularization. In ICLR, 2019.
- Rethinking network design and local geometry in point cloud: A simple residual MLP framework. In ICLR, 2022.
- Text2Mesh: Text-driven neural stylization for meshes. In CVPR, 2022.
- PartNet: A large-scale benchmark for fine-grained and hierarchical part-level 3D object understanding. In CVPR, 2019.
- CLIP-Mesh: Generating textured meshes from text using pretrained image-text models. In SIGGRAPH Asia, 2022.
- Masked autoencoders for point cloud self-supervised learning. In ECCV, 2022.
- OpenScene: 3D scene understanding with open vocabularies. In CVPR, 2023.
- PointNet: Deep learning on point sets for 3D classification and segmentation. In CVPR, 2017a.
- PointNet++: Deep hierarchical feature learning on point sets in a metric space. In NeurIPS, 2017b.
- Contrast with reconstruct: Contrastive 3D representation learning guided by generative pretraining. In ICML, 2023.
- PointNeXt: Revisiting PointNet++ with improved training and scaling strategies. In NeurIPS, 2022.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. In ICCV, 2021.
- Language-grounded indoor 3D semantic segmentation in the wild. In ECCV, 2022.
- LAION-5B: An open large-scale dataset for training next generation image-text models. In NeurIPS Datasets and Benchmarks Track, 2022.
- PV-RCNN: Point-voxel feature set abstraction for 3D object detection. In CVPR, 2020.
- MotionCLIP: Exposing human motion generation to CLIP space. In ECCV, 2022.
- LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data. In ICCV, 2019.
- Attention is all you need. In NeurIPS, 2017.
- SoftGroup for 3D instance segmentation on 3D point clouds. In CVPR, 2022.
- Point Transformer V2: Grouped vector attention and partition-based pooling. In NeurIPS, 2022.
- 3D ShapeNets: A deep representation for volumetric shapes. In CVPR, 2015.
- Dream3D: Zero-shot text-to-3D synthesis using 3D shape prior and text-to-image diffusion models. In CVPR, 2023.
- ULIP: Learning a unified representation of language, images, and point clouds for 3D understanding. In CVPR, 2023.
- RegionPLC: Regional point-language contrastive learning for open-world 3D scene understanding. arXiv preprint arXiv:2304.00962, 2023.
- Point-BERT: Pre-training 3D point cloud transformers with masked point modeling. In CVPR, 2022.
- Open-vocabulary object detection using captions. In CVPR, 2021.
- CLIP22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT: Contrastive language-image-point pretraining from real-world point cloud data. In CVPR, 2023.
- Point-M2AE: Multi-scale masked autoencoders for hierarchical point cloud pre-training. In NeurIPS, 2022a.
- PointCLIP: Point cloud understanding by CLIP. In CVPR, 2022b.
- Self-supervised pretraining of 3D features on any point-cloud. In ICCV, 2021.
- RegionCLIP: Region-based language-image pretraining. In CVPR, 2022.
- Detecting twenty-thousand classes using image-level supervision. In ECCV, 2022.
- PointCLIP V2: Prompting CLIP and GPT for powerful 3D open-world learning. In ICCV, 2023.
- Zhihao Zhang (61 papers)
- Shengcao Cao (13 papers)
- Yu-Xiong Wang (87 papers)