Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

M$^{3}$3D: Learning 3D priors using Multi-Modal Masked Autoencoders for 2D image and video understanding (2309.15313v1)

Published 26 Sep 2023 in cs.CV

Abstract: We present a new pre-training strategy called M${3}$3D ($\underline{M}$ulti-$\underline{M}$odal $\underline{M}$asked $\underline{3D}$) built based on Multi-modal masked autoencoders that can leverage 3D priors and learned cross-modal representations in RGB-D data. We integrate two major self-supervised learning frameworks; Masked Image Modeling (MIM) and contrastive learning; aiming to effectively embed masked 3D priors and modality complementary features to enhance the correspondence between modalities. In contrast to recent approaches which are either focusing on specific downstream tasks or require multi-view correspondence, we show that our pre-training strategy is ubiquitous, enabling improved representation learning that can transfer into improved performance on various downstream tasks such as video action recognition, video action detection, 2D semantic segmentation and depth estimation. Experiments show that M${3}$3D outperforms the existing state-of-the-art approaches on ScanNet, NYUv2, UCF-101 and OR-AR, particularly with an improvement of +1.3\% mIoU against Mask3D on ScanNet semantic segmentation. We further evaluate our method on low-data regime and demonstrate its superior data efficiency compared to current state-of-the-art approaches.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (74)
  1. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text, 2021.
  2. Self-supervised multimodal versatile networks, 2020.
  3. Look, listen and learn, 2017.
  4. Sit: Self-supervised vision transformer, 2022.
  5. MultiMAE: Multi-modal multi-task masked autoencoders. 2022.
  6. data2vec: A general framework for self-supervised learning in speech, vision and language, 2022.
  7. Bootstrap your own correspondences, 2021.
  8. Adamae: Adaptive masking for efficient spatiotemporal learning with masked autoencoders, 2022.
  9. BEit: BERT pre-training of image transformers. In International Conference on Learning Representations, 2022.
  10. Deep clustering for unsupervised learning of visual features, 2019.
  11. Unsupervised learning of visual features by contrasting cluster assignments. In Advances in Neural Information Processing Systems, volume 33, pages 9912–9924. Curran Associates, Inc., 2020.
  12. Emerging properties in self-supervised vision transformers, 2021.
  13. Learning aligned cross-modal representations from weakly aligned data, 2016.
  14. Pimae: Point cloud and image interactive masked autoencoders for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  15. Generative pretraining from pixels. In Proceedings of the 37th International Conference on Machine Learning, Proceedings of Machine Learning Research. PMLR, 2020.
  16. A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709, 2020.
  17. An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9640–9649, October 2021.
  18. 4dcontrast: Contrastive learning with dynamic correspondences for 3d scene understanding, 2022.
  19. Uniter: Universal image-text representation learning, 2020.
  20. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2017.
  21. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), 2019.
  22. Vi2clr: Video and image for visual contrastive learning of representation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1502–1512, October 2021.
  23. Multi-task self-supervised visual learning, 2017.
  24. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
  25. Are large-scale datasets necessary for self-supervised pre-training?, 2021.
  26. Masked autoencoders as spatiotemporal learners. arXiv:2205.09113, 2022.
  27. Convmae: Masked convolution meets masked autoencoders. arXiv preprint arXiv:2205.03892, 2022.
  28. Unsupervised representation learning by predicting image rotations, 2018.
  29. Omnivore: A single model for many visual modalities, 2022.
  30. Digging into self-supervised monocular depth prediction. October 2019.
  31. Bootstrap your own latent - a new approach to self-supervised learning. In Advances in Neural Information Processing Systems, volume 33, 2020.
  32. Self-supervised co-training for video representation learning. In Neurips, 2020.
  33. Masked autoencoders are scalable vision learners, 2021.
  34. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  35. An empirical study on activity recognition in long surgical videos. In Proceedings of the 2nd Machine Learning for Health symposium, Proceedings of Machine Learning Research. PMLR, 2022.
  36. Mask3d: Pre-training 2d vision transformers by learning masked 3d priors, 2023.
  37. Exploring data-efficient 3d scene understanding with contrastive scene contexts, 2021.
  38. Pri3d: Can 3d priors help 2d representation learning?, 2021.
  39. Unit: Multimodal multitask learning with a unified transformer, 2021.
  40. Perceiver IO: A general architecture for structured inputs & outputs. In International Conference on Learning Representations, 2022.
  41. Surgmae: Masked autoencoders for long surgical video analysis, 2023.
  42. One model to learn them all, 2017.
  43. The kinetics human action video dataset. CoRR, abs/1705.06950, 2017.
  44. Unsupervised representation learning by sorting sequence. In IEEE International Conference on Computer Vision, 2017.
  45. Semmae: Semantic-guided masking for learning masked autoencoders, 2022.
  46. A convnet for the 2020s, 2022.
  47. End-to-end learning of visual representations from uncurated instructional videos, 2020.
  48. Attention bottlenecks for multimodal fusion, 2022.
  49. Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and support inference from rgbd images. In ECCV, 2012.
  50. Unsupervised learning of visual representations by solving jigsaw puzzles, 2017.
  51. Audio-visual scene analysis with self-supervised multisensory features, 2018.
  52. Masked autoencoders for point cloud self-supervised learning. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part II, pages 604–621. Springer, 2022.
  53. Learning transferable visual models from natural language supervision, 2021.
  54. Improving language understanding by generative pre-training. 2018.
  55. Vision transformers for dense prediction. ArXiv preprint, 2021.
  56. Automatic operating room surgical activity recognition for robot-assisted surgery. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2020, 2020.
  57. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  58. Masked motion encoding for self-supervised video representation learning. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  59. Lxmert: Learning cross-modality encoder representations from transformers, 2019.
  60. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In Advances in Neural Information Processing Systems, volume 35, pages 10078–10093. Curran Associates, Inc., 2022.
  61. Representation learning with contrastive predictive coding, 2019.
  62. Videomae v2: Scaling video masked autoencoders with dual masking, 2023.
  63. Masked feature prediction for self-supervised visual pre-training. In CVPR, 2022.
  64. CroCo: Self-Supervised Pre-training for 3D Vision Tasks by Cross-View Completion. In NeurIPS, 2022.
  65. Pointcontrast: Unsupervised pre-training for 3d point cloud understanding, 2020.
  66. Simmim: A simple framework for masked image modeling. In International Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  67. Self-supervised spatiotemporal learning via video clip order prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  68. E2e-vlp: End-to-end vision-language pre-training enhanced by visual learning, 2021.
  69. Dual temperature helps contrastive learning without many negative samples: Towards understanding and simplifying moco, 2022.
  70. Point-m2ae: Multi-scale masked autoencoders for hierarchical point cloud pre-training. arXiv preprint arXiv:2205.14401, 2022.
  71. Colorful image colorization, 2016.
  72. Pcr-cg: Point cloud registration via deep color and geometry, 2023.
  73. Self-supervised pretraining of 3d features on any point-cloud, 2021.
  74. ibot: Image bert pre-training with online tokenizer, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Muhammad Abdullah Jamal (11 papers)
  2. Omid Mohareri (21 papers)