Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Cross-Modal Self-Training: Aligning Images and Pointclouds to Learn Classification without Labels (2404.10146v1)

Published 15 Apr 2024 in cs.CV

Abstract: Large-scale vision 2D vision LLMs, such as CLIP can be aligned with a 3D encoder to learn generalizable (open-vocabulary) 3D vision models. However, current methods require supervised pre-training for such alignment, and the performance of such 3D zero-shot models remains sub-optimal for real-world adaptation. In this work, we propose an optimization framework: Cross-MoST: Cross-Modal Self-Training, to improve the label-free classification performance of a zero-shot 3D vision model by simply leveraging unlabeled 3D data and their accompanying 2D views. We propose a student-teacher framework to simultaneously process 2D views and 3D point clouds and generate joint pseudo labels to train a classifier and guide cross-model feature alignment. Thereby we demonstrate that 2D vision LLMs such as CLIP can be used to complement 3D representation learning to improve classification performance without the need for expensive class annotations. Using synthetic and real-world 3D datasets, we further demonstrate that Cross-MoST enables efficient cross-modal knowledge exchange resulting in both image and point cloud modalities learning from each other's rich representations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. Pseudo-labeling and confirmation bias in deep semi-supervised learning, 2020.
  2. Beit: Bert pre-training of image transformers, 2022.
  3. Remixmatch: Semi-supervised learning with distribution alignment and augmentation anchoring, 2020.
  4. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9650–9660, 2021.
  5. ShapeNet: An information-rich 3D model repository. arXiv:1512.03012, 2015.
  6. A large dataset of object scans, 2016.
  7. Randaugment: Practical automated data augmentation with a reduced search space, 2019.
  8. Objaverse: A universe of annotated 3d objects, 2022.
  9. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
  10. Scf-net: Learning spatial contextual features for large-scale point cloud segmentation. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14499–14508, 2021.
  11. Pos-bert: Point cloud one-stage bert pre-training, 2022.
  12. Cyclip: Cyclic contrastive language-image pretraining, 2022.
  13. Semi-supervised learning by entropy minimization. In Proceedings of the 17th International Conference on Neural Information Processing Systems, page 529–536, Cambridge, MA, USA, 2004. MIT Press.
  14. PCT: Point cloud transformer. Computational Visual Media, 7(2):187–199, 2021.
  15. Revisiting self-training for neural sequence generation, 2020.
  16. Masked autoencoders are scalable vision learners, 2021.
  17. Clip goes 3d: Leveraging prompt tuning for language grounded 3d recognition, 2023.
  18. Clip2point: Transfer clip to point cloud classification with image-depth pre-training, 2022.
  19. Self-training for end-to-end speech recognition. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020.
  20. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
  21. Temporal ensembling for semi-supervised learning, 2017.
  22. Dong-Hyun Lee. Pseudo-label : The simple and efficient semi-supervised learning method for deep neural networks. 2013.
  23. Vit-lens: Towards omni-modal representations, 2023.
  24. Align before fuse: Vision and language representation learning with momentum distillation. In Advances in Neural Information Processing Systems, 2021.
  25. Masked unsupervised self-training for label-free image classification, 2023.
  26. Masked discrimination for self-supervised learning on point clouds, 2022.
  27. Openshape: Scaling up 3d shape representation towards open-world understanding, 2023.
  28. Voxnet: A 3d convolutional neural network for real-time object recognition. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 922–928, 2015.
  29. Slip: Self-supervision meets language-image pre-training, 2021.
  30. Masked autoencoders for point cloud self-supervised learning, 2022.
  31. Self-supervised learning of point clouds via orientation estimation. In 2020 International Conference on 3D Vision (3DV), pages 1018–1028, 2020.
  32. Pointnet: Deep learning on point sets for 3D classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017a.
  33. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems, 2017b.
  34. Pointnext: Revisiting pointnet++ with improved training and scaling strategies, 2022.
  35. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, pages 8748–8763, 2021.
  36. Global-local bidirectional reasoning for unsupervised representation learning of 3d point clouds, 2020.
  37. Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction, 2021.
  38. Better self-training for image classification through self-supervision, 2021.
  39. Self-supervised deep learning on point clouds by reconstructing space, 2019.
  40. Pv-rcnn: Point-voxel feature set abstraction for 3d object detection, 2021.
  41. Fixmatch: Simplifying semi-supervised learning with consistency and confidence, 2020.
  42. Multi-view convolutional neural networks for 3d shape recognition, 2015.
  43. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results, 2018.
  44. Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 1588–1597, 2019.
  45. Attention is all you need, 2017.
  46. Debiased learning from naturally imbalanced pseudo-labels, 2022.
  47. Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics (TOG), 2019.
  48. 3D ShapeNets: A deep representation for volumetric shapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  49. PointContrast: Unsupervised pre-training for 3D point cloud understanding. In European Conference on Computer Vision (ECCV), 2020.
  50. Simmim: A simple framework for masked image modeling, 2022.
  51. Paconv: Position adaptive convolution with dynamic kernel assembling on point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3173–3182, 2021.
  52. Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding, 2023.
  53. Point-bert: Pre-training 3d point cloud transformers with masked point modeling, 2022.
  54. Pointclip: Point cloud understanding by clip, 2021.
  55. Point-m2ae: Multi-scale masked autoencoders for hierarchical point cloud pre-training, 2022.
  56. Point transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 16259–16268, 2021.
  57. Uni3d: Exploring unified 3d representation at scale, 2023.
  58. Pointclip v2: Adapting clip for powerful 3d open-world learning, 2022.
  59. Rethinking pre-training and self-training. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Amaya Dharmasiri (6 papers)
  2. Muzammal Naseer (67 papers)
  3. Salman Khan (244 papers)
  4. Fahad Shahbaz Khan (225 papers)