Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

80 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

43 tokens/sec

o3 Pro

7 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

6 1

Masked Modeling for Self-supervised Representation Learning on Vision and Beyond (2401.00897v2)

Published 31 Dec 2023 in cs.CV and cs.AI

Abstract: As the deep learning revolution marches on, self-supervised learning has garnered increasing attention in recent years thanks to its remarkable representation learning ability and the low dependence on labeled data. Among these varied self-supervised techniques, masked modeling has emerged as a distinctive approach that involves predicting parts of the original data that are proportionally masked during training. This paradigm enables deep models to learn robust representations and has demonstrated exceptional performance in the context of computer vision, natural language processing, and other modalities. In this survey, we present a comprehensive review of the masked modeling framework and its methodology. We elaborate on the details of techniques within masked modeling, including diverse masking strategies, recovering targets, network architectures, and more. Then, we systematically investigate its wide-ranging applications across domains. Furthermore, we also explore the commonalities and differences between masked modeling methods in different fields. Toward the end of this paper, we conclude by discussing the limitations of current techniques and point out several potential avenues for advancing masked modeling research. A paper list project with this survey is available at \url{https://github.com/Lupin1998/Awesome-MIM}.

PDF HTML Abstract

Masked Modeling Framework

Masked Modeling, a framework gaining attention for self-supervised learning, is known for its robust ability to understand representations from unlabelled data. It works by predicting specific parts of data that are hidden or masked in the training phase. This technique has shown promising results in domains such as computer vision and natural language processing, extending its influence across various data types and tasks.

Masked Modeling in Computer Vision

In computer vision (CV), self-supervised learning techniques leverage generative and discriminative models to learn from untagged visual data. Masked Image Modeling (MIM) signifies an evolution in this space. Models such as MAE and SimMIM showcased exceptional performance, with MAE employing a Transformer to reconstruct pixel values from masked images and SimMIM streamlining the process by feeding both visible and masked patches to the encoder and employing a linear reconstruction head.

Innovations in MIM

Over time, various nuances in MIM have been explored. These include employing attention mechanisms for selecting challenging parts of an image to mask, using adversarial strategies to augment the complexity of reconstruction tasks, or applying context-based masking to handle local image information better. Technologies like vector quantization (VQ) have also been incorporated to improve MIM's data compression efficiency, further enhancing the model's ability to reconstruct and learn from data.

Theoretical Foundations of MIM

Despite its empirical success, MIM's theoretical underpinnings are yet to be fully understood. Current interpretations are rooted in hierarchical latent variable models, contrastive learning comparisons, and concepts of information compression. However, these theoretical insights are often confined to specific cases or empirical observations, making generalization across modalities a challenge.

Applications and Extensions

The applications of MIM extend to various downstream tasks in computer vision, such as object detection, depth estimation from images, video representation, and beyond. Innovations have adapted the MIM foundations to fit 3D point clouds and medical image analysis. It has also found applications in the expanding world of multimodal research, combining visual information with other data types like text and audio.

Future Directions

Moving forward, MIM's integration with multimodal approaches appears to be an essential direction. This could involve aligning different modalities through diffusion techniques for tasks like text-to-image conversions. Moreover, expanding MIM to accommodate higher-dimensional and multimodal data presents both technical challenges and exciting possibilities for advancing artificial intelligence research.

Conclusion

Masked Modeling, as a self-supervised learning framework, continues to evolve and grow within the field of AI. As researchers probe into its theoretical principles and push the boundaries of its application, MIM stands as a testament to the innovative spirit driving the continuous progression of learning algorithms.

PDF Markdown Bookmark Chat (Pro)

References (281)

Authors (11)

Siyuan Li (140 papers)
Luyuan Zhang (5 papers)
Zedong Wang (15 papers)
Di Wu (477 papers)
Lirong Wu (67 papers)
Zicheng Liu (153 papers)
Jun Xia (76 papers)
Cheng Tan (140 papers)
Yang Liu (2253 papers)
Baigui Sun (41 papers)
Stan Z. Li (222 papers)

Citations (9)

View on Semantic Scholar

GitHub

GitHub - Lupin1998/Awesome-MIM: [Survey] Masked Modeling for Self-supervised Representation Learning on Vision and Beyond (https://arxiv.org/abs/2401.00897) (260 stars)

Tweets

https://twitter.com/1249303876102000642/status/1742375928528035850

https://twitter.com/1249303876102000642/status/1742375418358120795

https://twitter.com/1249303876102000642/status/1742375689507258722

https://twitter.com/1249303876102000642/status/1742375086999662625