Masked Autoencoders Enable Efficient Knowledge Distillers (2208.12256v2)

Published 25 Aug 2022 in cs.CV

Abstract: This paper studies the potential of distilling knowledge from pre-trained models, especially Masked Autoencoders. Our approach is simple: in addition to optimizing the pixel reconstruction loss on masked inputs, we minimize the distance between the intermediate feature map of the teacher model and that of the student model. This design leads to a computationally efficient knowledge distillation framework, given 1) only a small visible subset of patches is used, and 2) the (cumbersome) teacher model only needs to be partially executed, ie, forward propagate inputs through the first few layers, for obtaining intermediate feature maps. Compared to directly distilling fine-tuned models, distilling pre-trained models substantially improves downstream performance. For example, by distilling the knowledge from an MAE pre-trained ViT-L into a ViT-B, our method achieves 84.0% ImageNet top-1 accuracy, outperforming the baseline of directly distilling a fine-tuned ViT-L by 1.2%. More intriguingly, our method can robustly distill knowledge from teacher models even with extremely high masking ratios: e.g., with 95% masking ratio where merely TEN patches are visible during distillation, our ViT-B competitively attains a top-1 ImageNet accuracy of 83.6%; surprisingly, it can still secure 82.4% top-1 ImageNet accuracy by aggressively training with just FOUR visible patches (98% masking ratio). The code and models are publicly available at https://github.com/UCSC-VLAA/DMAE.

Authors (8)

Yutong Bai (32 papers)
Zeyu Wang (137 papers)
Junfei Xiao (17 papers)
Chen Wei (72 papers)
Huiyu Wang (38 papers)
Alan Yuille (294 papers)
Yuyin Zhou (92 papers)
Cihang Xie (91 papers)

Citations (30)

View on Semantic Scholar

Summary

Overview of "Masked Autoencoders Enable Efficient Knowledge Distillers"

The paper "Masked Autoencoders Enable Efficient Knowledge Distillers" presents an innovative approach to knowledge distillation, particularly focusing on leveraging Masked Autoencoders (MAE) to enhance the efficiency and effectiveness of knowledge distillation frameworks. This paper addresses the challenge of transferring knowledge from large, pre-trained models, often characterized by their cumbersome architectures, to smaller, more efficient student models. The research team provides a compelling case for the use of MAEs in this context, arguing for their computational efficiency and robustness in knowledge transfer tasks, without the necessity of high computational overhead typically associated with full model execution.

Key Methodological Insights

The proposed method diverges from traditional knowledge distillation methods that align soft and hard logits between teacher and student models. Instead, this method focuses on aligning intermediate feature maps. Specifically, the approach involves using only a subset of visible patches, thanks to high masking ratios, to facilitate a computationally efficient training process. The strategy employed by the authors involves:

Intermediate Feature Alignment: The key to their approach is minimizing the disparity between intermediate feature maps from the teacher and student models. This strategy significantly reduces computational demand, as the teacher model only processes inputs through the initial few layers.
Robustness to High Masking Ratios: The authors demonstrate that their method remains effective even under extreme masking conditions. For example, with a 95% masking ratio, their method nearly matches the performance with much fewer visible patches, suggesting strong potential for efficiency in scenarios with limited visibility data.
Implementation Efficiency: Demonstrating efficiency, their experiments showed significant reductions in computational cost while maintaining or even improving model performance. Specifically, DMAE attained a top-1 ImageNet accuracy of 84.0% by transferring knowledge from a ViT-L to a ViT-B, outperforming existing methods with reduced computational cost.

Experimental Results and Comparisons

The paper provides strong empirical results, highlighting the performance enhancements achieved through their approach:

Comparison with Baselines: The DMAE framework outperforms several baseline methods, including MAE without distillation and other logit-based distillation techniques. This superiority is evident in both traditional and computationally constrained learning environments.
Scalability Across Model Sizes: DMAE demonstrates effectiveness across various model sizes, achieving performance gains with ViT-B, ViT-Small, and ViT-Tiny configurations.
Data Efficiency: The method excels in scenarios with limited data, achieving significant gains over other approaches when training data is sparse.

Implications and Future Directions

This research underscores the potential for enhancing AI models' efficiency through strategic application of masked autoencoding in knowledge distillation. The implications are twofold:

Practical Applications: By significantly lowering computational costs without sacrificing accuracy, the DMAE can be applied to real-world scenarios where computational resources are limited, thereby broadening the accessibility of high-performance AI models.
Theoretical Advancements: The paper invites further exploration into the integration of masked learning paradigms in other areas of machine learning and artificial intelligence. Future research could expand on these foundations to explore broader applications and further optimize the distillation of knowledge in varied model architectures and problem domains.

In conclusion, this work advances the field by providing a robust, computationally efficient framework for knowledge distillation. It suggests new pathways for utilizing pre-trained models in scalable and accessible AI applications, opening avenues for continued research and practical deployment in artificial intelligence.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - UCSC-VLAA/DMAE: [CVPR 2023] This repository includes the official implementation our paper "Masked Autoencoders Enable Efficient Knowledge Distillers" (99 stars)

Tweets

https://twitter.com/AryanPa66861306/status/1904004758320668925