Exploring Target Representations for Masked Autoencoders

Published 8 Sep 2022 in cs.CV | (2209.03917v3)

Abstract: Masked autoencoders have become popular training paradigms for self-supervised visual representation learning. These models randomly mask a portion of the input and reconstruct the masked portion according to the target representations. In this paper, we first show that a careful choice of the target representation is unnecessary for learning good representations, since different targets tend to derive similarly behaved models. Driven by this observation, we propose a multi-stage masked distillation pipeline and use a randomly initialized model as the teacher, enabling us to effectively train high-capacity models without any efforts to carefully design target representations. Interestingly, we further explore using teachers of larger capacity, obtaining distilled students with remarkable transferring ability. On different tasks of classification, transfer learning, object detection, and semantic segmentation, the proposed method to perform masked knowledge distillation with bootstrapped teachers (dBOT) outperforms previous self-supervised methods by nontrivial margins. We hope our findings, as well as the proposed method, could motivate people to rethink the roles of target representations in pre-training masked autoencoders.The code and pre-trained models are publicly available at https://github.com/liuxingbin/dbot.

Abstract PDF Upgrade to Chat

Authors (5)

Citations (44)

View on Semantic Scholar

Summary

The paper demonstrates that the choice of target representations is non-essential in training masked autoencoders when using multi-stage distillation.
It introduces Masked Knowledge Distillation (MKD) with bootstrapped teachers, employing random initialization to iteratively refine model performance.
Quantitative results show ViT-B, ViT-L, and ViT-H models achieving 84.5%, 86.6%, and 87.4% top-1 accuracies on ImageNet-1K, outperforming several state-of-the-art techniques.

An Expert Review of "Exploring Target Representations for Masked Autoencoders"

The paper "Exploring Target Representations for Masked Autoencoders" provides a comprehensive examination of the necessity and implications of choosing various target representations in self-supervised visual representation learning, specifically within the framework of Masked Autoencoders (MAEs). The authors introduce Masked Knowledge Distillation (MKD) as a formal extension of Masked Image Modeling (MIM), which serves as the conceptual backbone of these autoencoders.

Core Findings and Methodology

The primary assertion of the paper is that the meticulous selection of target representations in the training of MAEs is not essential. This is derived from the empirical findings that different target representations (such as DINO, MAE, DeiT, DALL-E, and random initialization) lead to student models with remarkably similar performances across various downstream tasks after multi-stage distillation processes. The paper proposes that this renders the costly pre-training of teachers using specific methodologies potentially superfluous.

Multi-Stage Masked Knowledge Distillation

To substantiate the claim that target representations become insignificant with staged training, the authors introduce a multi-stage distillation approach termed "dBOT" (distillation with bootstrapped teachers). Here, a randomly initialized model serves as an initial teacher, and distillation is carried out over numerous stages, utilizing the output of the student network of one stage as the teacher for the next. This iterative refinement process demonstrates that models trained using different pre-trained targets converge in performance and behavior.

Performance and Impact

Quantitatively, the paper reveals substantial performance improvements using their proposed model setup. For instance, ViT-B, ViT-L, and ViT-H models achieve top-1 accuracies of 84.5%, 86.6%, and 87.4% respectively on ImageNet-1K. These results are either competitive with or exceed those achieved by other state-of-the-art self-supervised methods. Additionally, the proposed method outperforms existing methods significantly across multiple tasks such as object detection and semantic segmentation, underscoring the effectiveness of the MKD approach with bootstrapped teachers.

Future Implications

The findings have several implications in the context of AI research and practical applications. First and foremost, they suggest a potential shift in the development of self-supervised learning models, emphasizing multi-stage techniques possibly offering a more efficient alternative to complex pre-training setups. Furthermore, by proposing that simpler random initializations could suffice in constructing effective teachers, the research hints at cost improvements in training high-capacity models.

Given the broad applicability of targeted MAEs in various vision recognition tasks and the increasing push for robust self-supervised learning frameworks, the insights provided could inspire a reevaluation of existing model training pipelines. Future research could extend these findings to explore the bounds of data sufficiency for achieving state-of-the-art performance, particularly when using more extensive datasets than ImageNet.

Conclusion

This work challenges the set paradigms in visual representation learning by scrutinizing the need for complex target choices when employing Masked Autoencoders, paving the way for more efficient and potentially less resource-intensive AI models. The systematic approach toward multi-stage distillation underscores its viability as a method that could significantly impact the development of self-supervised learning architectures, warranting further investigation and validation across a broader spectrum of datasets and tasks.

Markdown Report Issue