Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Masked Modeling for Self-supervised Representation Learning on Vision and Beyond (2401.00897v2)

Published 31 Dec 2023 in cs.CV and cs.AI

Abstract: As the deep learning revolution marches on, self-supervised learning has garnered increasing attention in recent years thanks to its remarkable representation learning ability and the low dependence on labeled data. Among these varied self-supervised techniques, masked modeling has emerged as a distinctive approach that involves predicting parts of the original data that are proportionally masked during training. This paradigm enables deep models to learn robust representations and has demonstrated exceptional performance in the context of computer vision, natural language processing, and other modalities. In this survey, we present a comprehensive review of the masked modeling framework and its methodology. We elaborate on the details of techniques within masked modeling, including diverse masking strategies, recovering targets, network architectures, and more. Then, we systematically investigate its wide-ranging applications across domains. Furthermore, we also explore the commonalities and differences between masked modeling methods in different fields. Toward the end of this paper, we conclude by discussing the limitations of current techniques and point out several potential avenues for advancing masked modeling research. A paper list project with this survey is available at \url{https://github.com/Lupin1998/Awesome-MIM}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (281)
  1. Masked siamese networks for label-efficient learning. In European Conference on Computer Vision, 2022.
  2. Self-supervised learning from images with a joint-embedding predictive architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15619–15629, June 2023.
  3. Mae-ast: Masked autoencoding audio spectrogram transformer. ArXiv, abs/2203.16691, 2022.
  4. Multimae: Multi-modal multi-task masked autoencoders. ArXiv, abs/2204.01678, 2022.
  5. Efficient self-supervised learning with contextualized target representations for vision, speech and language. 2022.
  6. data2vec: A general framework for self-supervised learning in speech, vision and language. In International Conference on Machine Learning, 2022.
  7. vq-wav2vec: Self-supervised learning of discrete speech representations. ArXiv, abs/1910.05453, 2019.
  8. wav2vec 2.0: A framework for self-supervised learning of speech representations. ArXiv, abs/2006.11477, 2020.
  9. Sequential modeling enables scalable learning for large vision models, 2023.
  10. Masked autoencoders enable efficient knowledge distillers. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 24256–24265, 2022.
  11. Adamae: Adaptive masking for efficient spatiotemporal learning with masked autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14507–14517, June 2023.
  12. Beit: Bert pre-training of image transformers. In International Conference on Learning Representations (ICLR), 2022.
  13. Learning to mask and permute visual tokens for vision transformer pre-training. ArXiv, abs/2306.07346, 2023.
  14. Gan dissection: Visualizing and understanding generative adversarial networks. arXiv preprint arXiv:1811.10597, 2018.
  15. Improving image generation with better captions.
  16. Ms marco: A human generated machine reading comprehension dataset. ArXiv, abs/1611.09268, 2016.
  17. A short note on the kinetics-700 human action dataset. ArXiv, abs/1907.06987, 2019.
  18. Mp3: A unified model to map, perceive, predict and plan. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14398–14407, 2021.
  19. Shapenet: An information-rich 3d model repository. ArXiv, abs/1512.03012, 2015.
  20. Maskgit: Masked generative image transformer. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11305–11315, 2022.
  21. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3557–3567, 2021.
  22. Pimae: Point cloud and image interactive masked autoencoders for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5291–5301, June 2023.
  23. Masked image training for generalizable deep image denoising. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1692–1703, 2023.
  24. Traj-mae: Masked autoencoders for trajectory prediction. ArXiv, abs/2303.06697, 2023.
  25. Improving masked autoencoders by learning where to mask. ArXiv, abs/2303.06583, 2023.
  26. Efficient self-supervised vision pretraining with local masked reconstruction. arXiv preprint arXiv:2206.00790, 2022.
  27. Mam: Masked acoustic modeling for end-to-end speech-to-text translation. ArXiv, abs/2010.11445, 2020.
  28. Mixed autoencoder for self-supervised visual representation learning. ArXiv, abs/2303.17152, 2023.
  29. Humanmac: Masked motion completion for human motion prediction. ArXiv, abs/2302.03665, 2023.
  30. Generative pretraining from pixels. In International Conference on Machine Learning, 2020.
  31. A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709, 2020.
  32. Context autoencoder for self-supervised representation learning. ArXiv, abs/2202.03026, 2022.
  33. Sdae: Self-distillated masked autoencoder. In European Conference on Computer Vision, 2022.
  34. Mask-guided vision transformer (mg-vit) for few-shot learning. ArXiv, abs/2205.09995, 2022.
  35. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. 2023.
  36. Forecast-mae: Self-supervised pre-training for motion forecasting with masked autoencoders. ArXiv, abs/2308.09882, 2023.
  37. Audio albert: A lite bert for self-supervised learning of audio representation. In 2021 IEEE Spoken Language Technology Workshop (SLT), pages 344–350. IEEE, 2021.
  38. Masked spectrogram prediction for self-supervised audio pre-training. ArXiv, abs/2204.12768, 2022.
  39. Speech2vec: A sequence-to-sequence framework for learning word embeddings from speech. ArXiv, abs/1803.08976, 2018.
  40. An analysis of single-layer networks in unsupervised feature learning. In International Conference on Artificial Intelligence and Statistics, 2011.
  41. Satmae: Pre-training transformers for temporal and multi-spectral satellite imagery. ArXiv, abs/2207.08051, 2022.
  42. The cityscapes dataset for semantic urban scene understanding. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3213–3223, 2016.
  43. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  44. Unsupervised visual representation learning by context prediction. 2015 IEEE International Conference on Computer Vision (ICCV), pages 1422–1430, 2015.
  45. Autoencoders as cross-modal teachers: Can pretrained 2d image transformers help 3d representation learning? In The Eleventh International Conference on Learning Representations (ICLR), 2023.
  46. Peco: Perceptual codebook for bert pre-training of vision transformers. In AAAI Conference on Artificial Intelligence, 2021.
  47. Bootstrapped masked autoencoders for vision bert pretraining. In European Conference on Computer Vision, 2022.
  48. Maskclip: Masked self-distillation advances contrastive language-image pretraining. ArXiv, abs/2208.12262, 2022.
  49. An image is worth 16x16 words: Transformers for image recognition at scale. ArXiv, abs/2010.11929, 2020.
  50. Taming transformers for high-resolution image synthesis, 2021.
  51. The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html.
  52. Motion-guided masking for spatiotemporal representation learning. ArXiv, abs/2308.12962, 2023.
  53. Corrupted image modeling for self-supervised visual pre-training. arXiv preprint arXiv:2202.03382, 2022.
  54. Eva-02: A visual representation for neon genesis. ArXiv, abs/2303.11331, 2023.
  55. Eva: Exploring the limits of masked visual representation learning at scale. ArXiv, abs/2211.07636, 2022.
  56. Unleashing vanilla vision transformer with masked image modeling for object detection. ArXiv, abs/2204.02964, 2022.
  57. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In 2004 Conference on Computer Vision and Pattern Recognition Workshop, pages 178–178, 2004.
  58. Masked autoencoders as spatiotemporal learners. ArXiv, abs/2205.09113, 2022.
  59. Mimco: Masked image modeling pre-training with contrastive teacher. Proceedings of the 30th ACM International Conference on Multimedia, 2022.
  60. Instructcv: Instruction-tuned text-to-image diffusion models as vision generalists. ArXiv, abs/2310.00390, 2023.
  61. Test-time training with masked autoencoders. ArXiv, abs/2209.07522, 2022.
  62. Pre-training antibody language models for antigen-specific computational antibody design. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 506–517, 2023.
  63. Convmae: Masked convolution meets masked autoencoders. ArXiv, abs/2205.03892, 2022.
  64. Simcse: Simple contrastive learning of sentence embeddings. ArXiv, abs/2104.08821, 2021.
  65. Vqpl: Vector quantized protein language. ArXiv, abs/2310.04985, 2023.
  66. Miles: Visual bert pre-training with injected language semantics for video-text retrieval. ArXiv, abs/2204.12408, 2022.
  67. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 776–780, 2017.
  68. Instructdiffusion: A generalist modeling interface for vision tasks. ArXiv, abs/2309.03895, 2023.
  69. Audiovisual masked autoencoders. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 16144–16154, October 2023.
  70. Omnimae: Single model masked pretraining on images and videos. ArXiv, abs/2206.08356, 2022.
  71. Bootstrap your own latent: A new approach to self-supervised learning. arXiv preprint arXiv:2006.07733, 2020.
  72. Ava: A video dataset of spatio-temporally localized atomic visual actions. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6047–6056, 2017.
  73. Fastmim: Expediting masked image modeling pre-training for vision. 2022.
  74. Lvis: A dataset for large vocabulary instance segmentation. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5351–5359, 2019.
  75. Maskvit: Masked visual pre-training for video prediction. ArXiv, abs/2206.11894, 2022.
  76. Siamese masked autoencoders. ArXiv, abs/2305.14344, 2023.
  77. Revcolv2: Exploring disentangled representations in masked image modeling. ArXiv, abs/2309.01005, 2023.
  78. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022.
  79. Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722, 2019.
  80. Mask r-cnn. In Proceedings of the International Conference on Computer Vision (ICCV), 2017.
  81. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  82. Pre-training co-evolutionary protein representation via a pairwise masked language model. arXiv preprint arXiv:2110.15527, 2021.
  83. Denoising diffusion probabilistic models. ArXiv, abs/2006.11239, 2020.
  84. The inaturalist challenge 2017 dataset. ArXiv, abs/1707.06642, 2017.
  85. Graphmae2: A decoding-enhanced masked self-supervised graph learner. In Proceedings of the ACM Web Conference 2023, pages 737–746, 2023.
  86. Graphmae: Self-supervised masked graph autoencoders. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 594–604, 2022.
  87. Milan: Masked image pretraining on language assisted representation. ArXiv, abs/2208.06049, 2022.
  88. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021.
  89. Exploring evolution-based &-free protein language models as protein function predictors. arXiv preprint arXiv:2206.06583, 2022.
  90. Strategies for pre-training graph neural networks. In ICLR, 2019.
  91. Gpt-gnn: Generative pre-training of graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1857–1867, 2020.
  92. Self-supervision through random segments with autoregressive coding (randsac). ArXiv, abs/2203.12054, 2022.
  93. Mgmae: Motion guided masking for video masked autoencoding. ArXiv, abs/2308.10794, 2023.
  94. Green hierarchical vision transformer for masked image modeling. ArXiv, abs/2205.13515, 2022.
  95. Masked autoencoders that listen. Advances in Neural Information Processing Systems, 35:28708–28720, 2022.
  96. Masked autoencoders that listen. ArXiv, abs/2207.06405, 2022.
  97. Improving adversarial robustness of masked autoencoders via test-time frequency-domain prompting. ArXiv, abs/2308.10315, 2023.
  98. Generic-to-specific distillation of masked autoencoders. ArXiv, abs/2302.14771, 2023.
  99. Contrastive masked autoencoders are stronger vision learners. ArXiv, abs/2207.13532, 2022.
  100. Layer grafted pre-training: Bridging contrastive learning and masked image modeling for label-efficient representations. In The Eleventh International Conference on Learning Representations, 2023.
  101. Masked siamese convnets. ArXiv, abs/2206.07700, 2022.
  102. Masked autoencoders are effective solution to transformer data-hungry. ArXiv, abs/2212.05677, 2022.
  103. Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589, 2021.
  104. What to hide from your students: Attention-guided masked image modeling. In Proceedings of the European Conference on Computer Vision (ECCV), 2022.
  105. The kinetics human action video dataset. ArXiv, abs/1705.06950, 2017.
  106. Mesa: Masked, geometric, and supervised pre-training for monocular depth estimation. ArXiv, abs/2310.04551, 2023.
  107. Understanding masked autoencoders via hierarchical latent variable models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7918–7928, June 2023.
  108. X. Kong and X. Zhang. Understanding masked image modeling via learning occlusion invariant feature. ArXiv, abs/2208.04164, 2022.
  109. 3d object representations for fine-grained categorization. In 2013 IEEE International Conference on Computer Vision Workshops, pages 554–561, 2013.
  110. A. Krizhevsky. Learning multiple layers of features from tiny images. 2009.
  111. An empirical study of self-supervised learning approaches for object detection with transformers. ArXiv, abs/2205.05543, 2022.
  112. Masked vision and language modeling for multi-modal representation learning. In International Conference on Learning Representations (ICLR), 2023.
  113. C.-I. Lai. Contrastive predictive coding based feature for automatic speaker verification. ArXiv, abs/1904.01575, 2019.
  114. Race: Large-scale reading comprehension dataset from examinations. ArXiv, abs/1704.04683, 2017.
  115. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942, 2019.
  116. Masked autoencoders are stronger knowledge distillers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6384–6393, October 2023.
  117. Vision transformer for small-size datasets. ArXiv, abs/2112.13492, 2021.
  118. Exploring the role of mean teachers in self-supervised masked auto-encoders. ArXiv, abs/2210.02077, 2022.
  119. Contrastive tuning: A little help to make masked autoencoders forget. ArXiv, abs/2304.10520, 2023.
  120. Dreamteacher: Pretraining image backbones with deep generative models. ArXiv, abs/2307.07487, 2023.
  121. Semmae: Semantic-guided masking for learning masked autoencoders. ArXiv, abs/2206.10207, 2022.
  122. Prototypical contrastive learning of unsupervised representations. ArXiv, abs/2005.04966, 2020.
  123. Efficient multi-order gated aggregation network. ArXiv, abs/2211.03295, 2022.
  124. Architecture-agnostic masked image modeling - from vit back to cnn. In International Conference on Machine Learning, 2023.
  125. Mage: Masked generative encoder to unify representation learning and image synthesis. arXiv preprint arXiv:2211.09117, 2022.
  126. Self-conditioned image generation via generating representations. 2023.
  127. mc-beit: Multi-choice discretization for image bert pre-training. In European Conference on Computer Vision, 2022.
  128. Uniform masking: Enabling mae pre-training for pyramid-based vision transformers with locality. ArXiv, abs/2205.10063, 2022.
  129. Scaling language-image pre-training via masking. ArXiv, abs/2212.00794, 2022.
  130. Mst: Masked self-supervised transformer for visual representation. In Neural Information Processing Systems, 2021.
  131. Meshmae: Masked autoencoders for 3d mesh data analysis. In Proceedings of the European Conference on Computer Vision (ECCV), 2022.
  132. Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45:3292–3310, 2021.
  133. Microsoft coco: Common objects in context. In European Conference on Computer Vision, 2014.
  134. Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6419–6423. IEEE, 2020.
  135. Masked prediction tasks: a parameter identifiability view. ArXiv, abs/2202.09305, 2022.
  136. Masked discrimination for self-supervised learning on point clouds. In European Conference on Computer Vision, 2022.
  137. The devil is in the frequency: Geminated gestalt autoencoder for self-supervised visual pre-training. In AAAI Conference on Artificial Intelligence, 2022.
  138. Language quantized autoencoders: Towards unsupervised text-image alignment. ArXiv, abs/2302.00902, 2023.
  139. Mixmim: Mixed and masked image modeling for efficient visual representation learning. ArXiv, abs/2205.13137, 2022.
  140. Towards better 3d knowledge transfer via masked image modeling for multi-view 3d understanding. ArXiv, abs/2303.11325, 2023.
  141. Docmae: Document image rectification via self-supervised representation learning. 2023.
  142. Self-supervised learning: Generative or contrastive. IEEE transactions on knowledge and data engineering, 35(1):857–876, 2021.
  143. Exploring target representations for masked autoencoders. arXiv preprint arXiv:2209.03917, 2022.
  144. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  145. Pixmim: Rethinking pixel reconstruction in masked image modeling. ArXiv, abs/2303.02416, 2023.
  146. Improving pixel-based mim by reducing wasted modeling capability. ArXiv, abs/2308.00261, 2023.
  147. A convnet for the 2020s. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  148. Self-supervised contrastive learning of protein representations by mutual information maximization. BioRxiv, 2020.
  149. Cmae-v: Contrastive masked autoencoders for video action recognition. ArXiv, abs/2301.06018, 2023.
  150. Unified-io: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916, 2022.
  151. Repaint: Inpainting using denoising diffusion probabilistic models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11451–11461, 2022.
  152. Self-distillation augmented masked autoencoders for histopathological image classification. ArXiv, abs/2203.16983, 2022.
  153. Maskocr: Text recognition with masked encoder-decoder pretraining. ArXiv, abs/2206.00311, 2022.
  154. Disjoint masking with joint distillation for efficient masked image modeling. ArXiv, abs/2301.00230, 2022.
  155. Fine-grained visual classification of aircraft. ArXiv, abs/1306.5151, 2013.
  156. Masked motion predictors are strong 3d action representation learners. ArXiv, abs/2308.07092, 2023.
  157. Adversarial contrastive pre-training for protein sequences. arXiv preprint arXiv:2102.00466, 2021.
  158. Rareact: A video dataset of unusual interactions. ArXiv, abs/2008.01018, 2020.
  159. Voxel-mae: Masked autoencoders for pre-training large-scale point clouds. ArXiv, abs/2206.09900, 2022.
  160. Transformer for graphs: An overview from architecture perspective. arXiv preprint arXiv:2202.08455, 2022.
  161. A simple, efficient and scalable contrastive masked autoencoder for learning visual representations. ArXiv, abs/2210.16870, 2022.
  162. Agedb: The first manually collected, in-the-wild age database. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1997–2005, 2017.
  163. Cmid: A unified self-supervised learning framework for remote sensing image understanding. IEEE Transactions on Geoscience and Remote Sensing, 2023.
  164. R-mae: Regions meet masked autoencoders. ArXiv, abs/2306.05411, 2023.
  165. M.-E. Nilsback and A. Zisserman. Automated flower classification over a large number of classes. In Indian Conference on Computer Vision, Graphics and Image Processing, Dec 2008.
  166. M. Noroozi and P. Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. ArXiv, abs/1603.09246, 2016.
  167. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  168. Img2vec: A teacher of high token-diversity helps masked autoencoders, 2023.
  169. Librispeech: An asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210, 2015.
  170. Masked autoencoders for point cloud self-supervised learning. In European Conference on Computer Vision, 2022.
  171. Beit v2: Masked image modeling with vector-quantized visual tokenizers. ArXiv, abs/2208.06366, 2022.
  172. A unified view of masked image modeling. 2022.
  173. Contrast with reconstruct: Contrastive 3d representation learning guided by generative pretraining. ArXiv, abs/2302.02318, 2023.
  174. Mar: Masked autoencoders for efficient action recognition. ArXiv, abs/2207.11660, 2022.
  175. Global contrast masked autoencoders are powerful pathological representation learners. arXiv:2205.09048, 2022.
  176. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 2021.
  177. Hierarchical text-conditional image generation with clip latents. ArXiv, abs/2204.06125, 2022.
  178. Zero-shot text-to-image generation. ArXiv, abs/2102.12092, 2021.
  179. Evaluating protein transfer learning with tape. Advances in neural information processing systems, 32, 2019.
  180. Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning. ArXiv, abs/2212.14532, 2022.
  181. Rejuvenating image-gpt as strong visual representation learners, 2023.
  182. Tinymim: An empirical study of distilling mim pre-trained models. 2023.
  183. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences, 118(15):e2016239118, 2021.
  184. Coherent multi-sentence video description with variable level of detail. In German Conference on Pattern Recognition, 2014.
  185. A dataset for movie description. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3202–3212, 2015.
  186. High-resolution image synthesis with latent diffusion models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685, 2021.
  187. Self-supervised graph transformer on large-scale molecular data. Advances in Neural Information Processing Systems, 33, 2020.
  188. Large-scale chemical language representations capture molecular structure and properties. Nature Machine Intelligence, 4(12):1256–1264, 2022.
  189. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115:211 – 252, 2014.
  190. Hiera: A hierarchical vision transformer without the bells-and-whistles. In International Conference on Machine Learning, 2023.
  191. Representation learning by detecting incorrect location embeddings. Proceedings of the AAAI Conference on Artificial Intelligence, 2022.
  192. Laion-5b: An open large-scale dataset for training next generation image-text models. ArXiv, abs/2210.08402, 2022.
  193. Pointcmp: Contrastive mask prediction for self-supervised learning on point cloud videos. In CVPR, 2023.
  194. Adversarial masking for self-supervised learning. In International Conference on Machine Learning, pages 20026–20040. PMLR, 2022.
  195. Hollywood in homes: Crowdsourcing data collection for activity understanding. In European Conference on Computer Vision, 2016.
  196. The effectiveness of mae pre-pretraining for billion-scale pretraining. ArXiv, abs/2303.13496, 2023.
  197. It takes two: Masked appearance-motion modeling for self-supervised video transformer pre-training. ArXiv, abs/2210.05234, 2022.
  198. Ucf101: A dataset of 101 human actions classes from videos in the wild. ArXiv, abs/1212.0402, 2012.
  199. Saprot: Protein language modeling with structure-aware vocabulary. bioRxiv, 2023.
  200. Vl-bert: Pre-training of generic visual-linguistic representations. ArXiv, abs/1908.08530, 2019.
  201. Mgae: Masked autoencoders for self-supervised learning on graphs. arXiv preprint arXiv:2201.02534, 2022.
  202. Siamese image modeling for self-supervised vision representation learning. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2132–2141, 2022.
  203. Designing bert for convolutional networks: Sparse and hierarchical masked modeling. ArXiv, abs/2301.03580, 2023.
  204. Geomae: Masked geometric target prediction for self-supervised point cloud pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13570–13580, June 2023.
  205. Integrally pre-trained transformer pyramid networks. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18610–18620, 2022.
  206. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. ArXiv, abs/2203.12602, 2022.
  207. Neural discrete representation learning. ArXiv, abs/1711.00937, 2017.
  208. Foldseek: fast and accurate protein structure search. 2022.
  209. Attention is all you need. In NIPS, pages 5998–6008, 2017.
  210. The caltech-ucsd birds-200-2011 dataset. 2011.
  211. Droppos: Pre-training vision transformers by reconstructing dropped positions. ArXiv, abs/2309.03576, 2023.
  212. Hard patches mining for masked image modeling. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  213. Masked image modeling with local multi-scale reconstruction. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2122–2131, 2023.
  214. Facemae: Privacy-preserving face recognition via masked autoencoders. ArXiv, abs/2205.11090, 2022.
  215. Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2023.
  216. Repre: Improving self-supervised vision transformer with reconstructive pre-training. In International Joint Conference on Artificial Intelligence, 2022.
  217. M. Wang and W. Deng. Oracle-mnist: a realistic image dataset for benchmarking machine learning algorithms. ArXiv, abs/2205.09442, 2022.
  218. A closer look at self-supervised lightweight vision transformers. ArXiv, abs/2205.14443, 2022.
  219. Smiles-bert: large scale unsupervised pre-training for molecular property prediction. In Proceedings of the 10th ACM international conference on bioinformatics, computational biology and health informatics, pages 429–436, 2019.
  220. Image as a foreign language: Beit pretraining for all vision and vision-language tasks. ArXiv, abs/2208.10442, 2022.
  221. Fremae: Fourier transform meets masked autoencoders for medical image segmentation. 2023.
  222. Images speak in images: A generalist painter for in-context visual learning. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6830–6839, 2022.
  223. Less is more: Consistent video depth estimation with masked frames modeling. ArXiv, abs/2208.00380, 2022.
  224. Masked feature prediction for self-supervised visual pre-training. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14648–14658, 2021.
  225. Mvp: Multimodality-guided visual pre-training. ArXiv, abs/2203.05175, 2022.
  226. Contrastive learning rivals masked image modeling in fine-tuning via feature distillation. ArXiv, abs/2205.14141, 2022.
  227. On the road with gpt-4v(ision): Early explorations of visual-language model on autonomous driving, 2023.
  228. Convnext v2: Co-designing and scaling convnets with masked autoencoders. ArXiv, abs/2301.00808, 2023.
  229. J. Wu and S. Mo. Object-wise masked autoencoders for fast pre-training. ArXiv, abs/2205.14338, 2022.
  230. Dropmae: Masked autoencoders with spatial-attention dropout for tracking tasks. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14561–14571, 2023.
  231. Denoising masked autoencoders are certifiable robust vision learners. ArXiv, abs/2210.06983, 2022.
  232. Extreme masking for learning instance and distributed visual representations. ArXiv, abs/2206.04667, 2022.
  233. Unsupervised feature learning via non-parametric instance-level discrimination. ArXiv, abs/1805.01978, 2018.
  234. Aid: A benchmark data set for performance evaluation of aerial scene classification. IEEE Transactions on Geoscience and Remote Sensing, 55:3965–3981, 2016.
  235. Mole-bert: Rethinking pre-training graph neural networks for molecules. In The Eleventh International Conference on Learning Representations, 2022.
  236. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. ArXiv, abs/1708.07747, 2017.
  237. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 3485–3492, 2010.
  238. Masked images are counterfactual samples for robust fine-tuning. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20301–20310, 2023.
  239. Masked frequency modeling for self-supervised visual pre-training. ArXiv, abs/2206.07706, 2022.
  240. Revealing the dark secrets of masked image modeling. ArXiv, abs/2205.13543, 2022.
  241. Simmim: a simple framework for masked image modeling. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9643–9653, 2021.
  242. On data scaling in masked image modeling. ArXiv, abs/2206.04664, 2022.
  243. Masked autoencoders are robust data augmentors. ArXiv, abs/2206.04846, 2022.
  244. Stare at what you see: Masked image modeling without reconstruction. ArXiv, abs/2211.08887, 2022.
  245. Skeletonmae: Graph-based masked autoencoder for skeleton sequence pre-training. ArXiv, abs/2307.08476, 2023.
  246. Videogpt: Video generation using vq-vae and transformers. ArXiv, abs/2104.10157, 2021.
  247. Self-supervised video representation learning with motion-aware masked autoencoders. 2022.
  248. Unipad: A universal pre-training paradigm for autonomous driving. ArXiv, abs/2310.08370, 2023.
  249. Masked inverse folding with sequence transfer for protein representation learning. bioRxiv, 2022.
  250. Mrm: Masked relation modeling for medical image pre-training with genetics. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 21452–21462, October 2023.
  251. Attentive mask clip. 2022.
  252. Moma: Distill from self-supervised teachers. ArXiv, abs/2302.02089, 2023.
  253. Masked image modeling with denoising contrast. ArXiv, abs/2205.09616, 2022.
  254. Y. You and Y. Shen. Cross-modality and self-supervised protein embedding for compound–protein affinity and contact prediction. Bioinformatics, 38(Supplement_2):ii68–ii74, 2022.
  255. Magvit: Masked generative video transformer. ArXiv, abs/2212.05199, 2022.
  256. Spae: Semantic pyramid autoencoder for multimodal generation with frozen llms. ArXiv, abs/2306.17842, 2023.
  257. Gradient surgery for multi-task learning. ArXiv, abs/2001.06782, 2020.
  258. Point-bert: Pre-training 3d point cloud transformers with masked point modeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  259. Object recognition as next token prediction. 2023.
  260. Masked autoencoders are efficient class incremental learners. ArXiv, abs/2308.12510, 2023.
  261. Position prediction as an effective pretraining strategy. In International Conference on Machine Learning, 2022.
  262. Scaling vision transformers. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1204–1213, 2021.
  263. K. Zhang and Z. Shen. i-mae: Are latent representations in masked autoencoders linearly separable? ArXiv, abs/2210.11470, 2022.
  264. How mask matters: Towards theoretical understandings of masked autoencoders. ArXiv, abs/2210.08344, 2022.
  265. Point-m2ae: Multi-scale masked autoencoders for hierarchical point cloud pre-training. ArXiv, abs/2205.14401, 2022.
  266. Learning 3d representations from 2d pre-trained models via image-to-point masked autoencoders. In CVPR, 2023.
  267. Graph masked autoencoders with transformers. arXiv preprint arXiv:2202.08391, 2022.
  268. Contextual image masking modeling via synergized contrasting without view augmentation for faster and better visual pretraining. In The Eleventh International Conference on Learning Representations, 2023.
  269. Cae v2: Context autoencoder with clip target. ArXiv, abs/2211.09799, 2022.
  270. Integrally migrating pre-trained transformer encoder-decoders for visual object detection. 2022.
  271. Hivit: Hierarchical vision transformer meets masked image modeling. ArXiv, abs/2205.14949, 2022.
  272. Meta-transformer: A unified framework for multimodal learning. ArXiv, abs/2307.10802, 2023.
  273. Protein representation learning by geometric structure pretraining. In International Conference on Learning Representations (ICLR), 2023.
  274. Masked retraining teacher-student framework for domain adaptive object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 19039–19049, October 2023.
  275. Cim: Constrained intrinsic motivation for sparse-reward continuous control. ArXiv, abs/2211.15205, 2022.
  276. Sparsemae: Sparse training meets masked autoencoders. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 16176–16186, October 2023.
  277. Places: An image database for deep scene understanding. ArXiv, abs/1610.02055, 2016.
  278. Scene parsing through ade20k dataset. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5122–5130, 2017.
  279. ibot: Image bert pre-training with online tokenizer. International Conference on Learning Representations (ICLR), 2022.
  280. Self pre-training with masked autoencoders for medical image analysis. ArXiv, abs/2203.05573, 2022.
  281. Vl-gpt: A generative pre-trained transformer for vision and language understanding and generation. 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Siyuan Li (140 papers)
  2. Luyuan Zhang (5 papers)
  3. Zedong Wang (15 papers)
  4. Di Wu (477 papers)
  5. Lirong Wu (67 papers)
  6. Zicheng Liu (153 papers)
  7. Jun Xia (76 papers)
  8. Cheng Tan (140 papers)
  9. Yang Liu (2253 papers)
  10. Baigui Sun (41 papers)
  11. Stan Z. Li (222 papers)
Citations (9)

Summary

Masked Modeling Framework

Masked Modeling, a framework gaining attention for self-supervised learning, is known for its robust ability to understand representations from unlabelled data. It works by predicting specific parts of data that are hidden or masked in the training phase. This technique has shown promising results in domains such as computer vision and natural language processing, extending its influence across various data types and tasks.

Masked Modeling in Computer Vision

In computer vision (CV), self-supervised learning techniques leverage generative and discriminative models to learn from untagged visual data. Masked Image Modeling (MIM) signifies an evolution in this space. Models such as MAE and SimMIM showcased exceptional performance, with MAE employing a Transformer to reconstruct pixel values from masked images and SimMIM streamlining the process by feeding both visible and masked patches to the encoder and employing a linear reconstruction head.

Innovations in MIM

Over time, various nuances in MIM have been explored. These include employing attention mechanisms for selecting challenging parts of an image to mask, using adversarial strategies to augment the complexity of reconstruction tasks, or applying context-based masking to handle local image information better. Technologies like vector quantization (VQ) have also been incorporated to improve MIM's data compression efficiency, further enhancing the model's ability to reconstruct and learn from data.

Theoretical Foundations of MIM

Despite its empirical success, MIM's theoretical underpinnings are yet to be fully understood. Current interpretations are rooted in hierarchical latent variable models, contrastive learning comparisons, and concepts of information compression. However, these theoretical insights are often confined to specific cases or empirical observations, making generalization across modalities a challenge.

Applications and Extensions

The applications of MIM extend to various downstream tasks in computer vision, such as object detection, depth estimation from images, video representation, and beyond. Innovations have adapted the MIM foundations to fit 3D point clouds and medical image analysis. It has also found applications in the expanding world of multimodal research, combining visual information with other data types like text and audio.

Future Directions

Moving forward, MIM's integration with multimodal approaches appears to be an essential direction. This could involve aligning different modalities through diffusion techniques for tasks like text-to-image conversions. Moreover, expanding MIM to accommodate higher-dimensional and multimodal data presents both technical challenges and exciting possibilities for advancing artificial intelligence research.

Conclusion

Masked Modeling, as a self-supervised learning framework, continues to evolve and grow within the field of AI. As researchers probe into its theoretical principles and push the boundaries of its application, MIM stands as a testament to the innovative spirit driving the continuous progression of learning algorithms.