Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MM-Mixing: Multi-Modal Mixing Alignment for 3D Understanding (2405.18523v2)

Published 28 May 2024 in cs.CV and cs.AI

Abstract: We introduce MM-Mixing, a multi-modal mixing alignment framework for 3D understanding. MM-Mixing applies mixing-based methods to multi-modal data, preserving and optimizing cross-modal connections while enhancing diversity and improving alignment across modalities. Our proposed two-stage training pipeline combines feature-level and input-level mixing to optimize the 3D encoder. The first stage employs feature-level mixing with contrastive learning to align 3D features with their corresponding modalities. The second stage incorporates both feature-level and input-level mixing, introducing mixed point cloud inputs to further refine 3D feature representations. MM-Mixing enhances intermodality relationships, promotes generalization, and ensures feature consistency while providing diverse and realistic training samples. We demonstrate that MM-Mixing significantly improves baseline performance across various learning scenarios, including zero-shot 3D classification, linear probing 3D classification, and cross-modal 3D shape retrieval. Notably, we improved the zero-shot classification accuracy on ScanObjectNN from 51.3% to 61.9%, and on Objaverse-LVIS from 46.8% to 51.4%. Our findings highlight the potential of multi-modal mixing-based alignment to significantly advance 3D object recognition and understanding while remaining straightforward to implement and integrate into existing frameworks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (74)
  1. Satr: Zero-shot semantic segmentation of 3d shapes. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15166–15179, 2023.
  2. Self-supervised learning for domain adaptation on point clouds. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 123–133, 2021.
  3. Learning representations and generative models for 3d point clouds. In International conference on machine learning, pages 40–49. PMLR, 2018.
  4. Clipface: Text-guided editing of textured 3d morphable models. In ACM SIGGRAPH 2023 Conference Proceedings, pages 1–11, 2023.
  5. Text and image guided 3d avatar generation and manipulation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 4421–4431, 2023.
  6. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
  7. Signvtcl: Multi-modal continuous sign language recognition enhanced by visual-textual contrastive learning. arXiv preprint arXiv:2401.11847, 2024.
  8. Clip2scene: Towards label-efficient 3d scene understanding by clip. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7020–7030, 2023.
  9. Multi-view 3d object detection network for autonomous driving. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1907–1915, 2017.
  10. Pointmixup: Augmentation for point clouds. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, pages 330–345. Springer, 2020.
  11. 4d spatio-temporal convnets: Minkowski convolutional neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3075–3084, 2019.
  12. Abo: Dataset and benchmarks for real-world 3d object understanding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21126–21136, 2022.
  13. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13142–13153, 2023.
  14. Ppf-foldnet: Unsupervised learning of rotation invariant 3d local descriptors. In Proceedings of the European conference on computer vision (ECCV), pages 602–618, 2018.
  15. Pla: Language-driven open-vocabulary 3d scene understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7010–7019, 2023.
  16. 3d-future: 3d furniture shape with texture. International Journal of Computer Vision, 129:3313–3337, 2021.
  17. Revisiting point cloud shape classification with a simple and effective baseline. In International Conference on Machine Learning, pages 3809–3820. PMLR, 2021.
  18. Joint-mae: 2d-3d joint masked autoencoders for 3d point cloud pre-training. arXiv preprint arXiv:2302.14007, 2023a.
  19. Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following. arXiv preprint arXiv:2309.00615, 2023b.
  20. Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5356–5364, 2019.
  21. Semantic abstraction: Open-world 3d scene understanding from 2d vision-language models. arXiv preprint arXiv:2207.11514, 2022.
  22. Clip goes 3d: Leveraging prompt tuning for language grounded 3d recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2028–2038, 2023.
  23. Masked autoencoder for self-supervised pre-training on lidar point clouds. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 350–359, 2023.
  24. Avatarclip: Zero-shot text-driven generation and animation of 3d avatars. arXiv preprint arXiv:2205.08535, 2022.
  25. Joint representation learning for text and 3d point cloud. Pattern Recognition, 147:110086, 2024.
  26. Clip2point: Transfer clip to point cloud classification with image-depth pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22157–22167, 2023.
  27. Conceptfusion: Open-set multimodal 3d mapping. arXiv preprint arXiv:2302.07241, 2023.
  28. Rotationnet: Joint object categorization and pose estimation using multiviews from unsupervised viewpoints. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5010–5019, 2018.
  29. Point cloud augmentation with weighted local transformations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 548–557, 2021.
  30. Regularization strategy for point cloud via rigidly mixed sample. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15900–15909, 2021.
  31. Sagemix: Saliency-guided mixup for point clouds. Advances in Neural Information Processing Systems, 35:23580–23592, 2022.
  32. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning, pages 12888–12900. PMLR, 2022.
  33. Pointaugment: an auto-augmentation framework for point cloud classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6378–6387, 2020.
  34. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. Advances in Neural Information Processing Systems, 35:17612–17625, 2022.
  35. Partslip: Low-shot part segmentation for 3d point clouds via pretrained image-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21736–21746, 2023.
  36. Openshape: Scaling up 3d shape representation towards open-world understanding. Advances in Neural Information Processing Systems, 36, 2024.
  37. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  38. Rethinking network design and local geometry in point cloud: A simple residual mlp framework. arXiv preprint arXiv:2202.07123, 2022.
  39. Masked autoencoders for point cloud self-supervised learning. In European conference on computer vision, pages 604–621. Springer, 2022.
  40. Openscene: 3d scene understanding with open vocabularies. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 815–824, 2023.
  41. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660, 2017a.
  42. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems, 30, 2017b.
  43. Contrast with reconstruct: Contrastive 3d representation learning guided by generative pretraining. In International Conference on Machine Learning, pages 28223–28243. PMLR, 2023.
  44. Pointnext: Revisiting pointnet++ with improved training and scaling strategies. Advances in Neural Information Processing Systems, 35:23192–23204, 2022.
  45. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  46. Randomrooms: Unsupervised pre-training from synthetic shapes and randomized layouts for 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3283–3292, 2021.
  47. Benchmarking and analyzing point cloud classification under corruptions. In International Conference on Machine Learning, pages 18559–18575. PMLR, 2022.
  48. Octnet: Learning deep 3d representations at high resolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3577–3586, 2017.
  49. Language-grounded indoor 3d semantic segmentation in the wild. In European Conference on Computer Vision, pages 125–141. Springer, 2022.
  50. Semantic scene completion from a single depth image. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1746–1754, 2017.
  51. Multi-view convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE international conference on computer vision, pages 945–953, 2015.
  52. Point mixswap: Attentional point cloud mixing via swapping matched structural divisions. In European Conference on Computer Vision, pages 596–611. Springer, 2022.
  53. Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1588–1597, 2019.
  54. Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
  55. Category-level 6d object pose estimation via cascaded relation and recurrent reconstruction networks. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4807–4814. IEEE, 2021.
  56. Pointpatchmix: Point cloud mixing with patch scoring. In Proceedings of the AAAI Conference on Artificial Intelligence, 2024.
  57. Point transformer v2: Grouped vector attention and partition-based pooling. Advances in Neural Information Processing Systems, 35:33330–33342, 2022.
  58. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1912–1920, 2015.
  59. Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1179–1189, 2023a.
  60. Ulip-2: Towards scalable multimodal pre-training for 3d understanding. arXiv preprint arXiv:2305.08275, 2023b.
  61. Point-bert: Pre-training 3d point cloud transformers with masked point modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19313–19322, 2022.
  62. Multimodal contrastive training for visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6995–7004, 2021.
  63. Clip2: Contrastive language-image-point pretraining from real-world point cloud data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15244–15253, 2023.
  64. Glipv2: Unifying localization and vision-language understanding. Advances in Neural Information Processing Systems, 35:36067–36080, 2022a.
  65. Pointcutmix: Regularization strategy for point cloud classification. Neurocomputing, 505:58–67, 2022b.
  66. Clip-fo3d: Learning free open-world 3d scene representations from 2d dense clip. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2048–2059, 2023a.
  67. Point-m2ae: multi-scale masked autoencoders for hierarchical point cloud pre-training. Advances in neural information processing systems, 35:27061–27074, 2022c.
  68. Pointclip: Point cloud understanding by clip. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8552–8562, 2022d.
  69. Parameter is not all you need: Starting from non-parametric networks for 3d point cloud analysis. arXiv preprint arXiv:2303.08134, 2023b.
  70. Learning 3d representations from 2d pre-trained models via image-to-point masked autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21769–21780, 2023c.
  71. Tamm: Triadapter multi-modal learning for 3d shape understanding. arXiv preprint arXiv:2402.18490, 2024.
  72. Point transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 16259–16268, 2021.
  73. Pointclip v2: Prompting clip and gpt for powerful 3d open-world learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2639–2650, 2023.
  74. No time to train: Empowering non-parametric networks for few-shot 3d scene segmentation. CVPR 2024 Highlight, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Jiaze Wang (15 papers)
  2. Yi Wang (1038 papers)
  3. Ziyu Guo (49 papers)
  4. Renrui Zhang (100 papers)
  5. Donghao Zhou (15 papers)
  6. Guangyong Chen (55 papers)
  7. Anfeng Liu (10 papers)
  8. Pheng-Ann Heng (196 papers)