Generic-to-Specific Distillation of Masked Autoencoders (2302.14771v1)

Published 28 Feb 2023 in cs.CV

Abstract: Large vision Transformers (ViTs) driven by self-supervised pre-training mechanisms achieved unprecedented progress. Lightweight ViT models limited by the model capacity, however, benefit little from those pre-training mechanisms. Knowledge distillation defines a paradigm to transfer representations from large (teacher) models to small (student) ones. However, the conventional single-stage distillation easily gets stuck on task-specific transfer, failing to retain the task-agnostic knowledge crucial for model generalization. In this study, we propose generic-to-specific distillation (G2SD), to tap the potential of small ViT models under the supervision of large models pre-trained by masked autoencoders. In generic distillation, decoder of the small model is encouraged to align feature predictions with hidden representations of the large model, so that task-agnostic knowledge can be transferred. In specific distillation, predictions of the small model are constrained to be consistent with those of the large model, to transfer task-specific features which guarantee task performance. With G2SD, the vanilla ViT-Small model respectively achieves 98.7%, 98.1% and 99.3% the performance of its teacher (ViT-Base) for image classification, object detection, and semantic segmentation, setting a solid baseline for two-stage vision distillation. Code will be available at https://github.com/pengzhiliang/G2SD.

PDF Abstract

Overview of Generic-to-Specific Distillation of Masked Autoencoders

The paper "Generic-to-Specific Distillation of Masked Autoencoders" by Wei Huang et al. presents a novel approach to improving the performance of lightweight Vision Transformers (ViTs) through a two-stage knowledge distillation method termed Generic-to-Specific Distillation (G2SD). This method addresses the challenge faced by small ViT models in benefiting from self-supervised pre-training mechanisms that have been successful with large ViT models. Large ViTs like ViT-Base have shown remarkable progress when pre-trained with techniques such as masked autoencoding. However, their smaller counterparts, such as ViT-Tiny and ViT-Small, have not experienced similar gains due to limited model capacity. The research introduces G2SD as a means to leverage the representational strength of larger, pre-trained models to enhance the performance and generalization capabilities of smaller ViTs.

Distillation Approach

The paper suggests that traditional single-stage knowledge distillation methods, which primarily focus on task-specific learning, inadequately capture task-agnostic knowledge that is vital for model generalization. To overcome this, G2SD involves two stages:

Generic Distillation: This phase aims to transfer the task-agnostic knowledge from a pre-trained, large ViT model (the teacher) to a smaller ViT model (the student). This is achieved by encouraging the student's decoder to align its feature predictions with the hidden representations of the teacher model. By doing so, the model gains an enhanced understanding of context-independent features that improve its adaptability across different tasks.
Specific Distillation: In this stage, the focus shifts to ensuring that the student's output predictions align with those of the teacher model, which has been fine-tuned for specific tasks. This task-specific learning phase emphasizes the precision of output predictions to ensure task performance.

Experimental Evaluation

The research evaluates G2SD across several tasks: image classification, object detection, and semantic segmentation, using datasets such as ImageNet-1k, MS COCO, and ADE20k. The results demonstrate notable improvements in performance:

For image classification, the ViT-Small model with G2SD achieved up to 98.7% of the performance of its ViT-Base teacher on ImageNet-1k.
In the domain of object detection and instance segmentation on MS COCO, the student models reached 98.1% and 99.3% of their teacher's performance, respectively.
Notably, the models distilled with G2SD surpassed their counterparts in scenarios with occlusion invariance and robustness, reflecting their improved generalization ability.

Insights and Implications

By leveraging both task-agnostic and task-specific knowledge, G2SD proposes a new baseline for two-stage vision model distillation, paving the way for advancements in training more efficient yet capable ViTs. The decoupled approach to distillation opens up possibilities for developing lightweight models that maintain competitive performance without exhaustive computational resources.

Future Directions

This paper highlights the potential of masked autoencoders' learning paradigms in distillation techniques, implying that similar concepts could be extended to other fields such as speech and LLMs. The research invites further exploration into fine-tuning the balance between generic and specific knowledge transfer to maximize the benefits of distillation for various model architectures and tasks.

In conclusion, the G2SD method presents a significant step towards making small-scale ViT models viable for practical applications by harnessing the capabilities of larger models through an innovative distillation approach.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Wei Huang (318 papers)
Zhiliang Peng (13 papers)
Li Dong (154 papers)
Furu Wei (291 papers)
Jianbin Jiao (51 papers)
Qixiang Ye (110 papers)

Citations (18)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - pengzhiliang/G2SD (83 stars)