Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Masked Generative Distillation (2205.01529v2)

Published 3 May 2022 in cs.CV

Abstract: Knowledge distillation has been applied to various tasks successfully. The current distillation algorithm usually improves students' performance by imitating the output of the teacher. This paper shows that teachers can also improve students' representation power by guiding students' feature recovery. From this point of view, we propose Masked Generative Distillation (MGD), which is simple: we mask random pixels of the student's feature and force it to generate the teacher's full feature through a simple block. MGD is a truly general feature-based distillation method, which can be utilized on various tasks, including image classification, object detection, semantic segmentation and instance segmentation. We experiment on different models with extensive datasets and the results show that all the students achieve excellent improvements. Notably, we boost ResNet-18 from 69.90% to 71.69% ImageNet top-1 accuracy, RetinaNet with ResNet-50 backbone from 37.4 to 41.0 Boundingbox mAP, SOLO based on ResNet-50 from 33.1 to 36.2 Mask mAP and DeepLabV3 based on ResNet-18 from 73.20 to 76.02 mIoU. Our codes are available at https://github.com/yzd-v/MGD.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Zhendong Yang (10 papers)
  2. Zhe Li (210 papers)
  3. Mingqi Shao (6 papers)
  4. Dachuan Shi (8 papers)
  5. Zehuan Yuan (65 papers)
  6. Chun Yuan (127 papers)
Citations (135)

Summary

Overview of "Masked Generative Distillation"

The paper "Masked Generative Distillation" introduces an innovative approach to knowledge distillation, specifically focusing on improving the representational capabilities of student models through a novel technique named Masked Generative Distillation (MGD). This proposed method deviates from traditional knowledge distillation strategies, which predominantly focus on having the student models mimic the outputs of their teacher counterparts. Instead, MGD leverages feature masking and generation to enhance the representation power of student models. The paper provides substantial evidence of MGD's effectiveness across a variety of computer vision tasks, including image classification, object detection, semantic segmentation, and instance segmentation, demonstrating notable performance improvements in each domain.

Methodology

MGD is a feature-based distillation technique designed to enable students to recover a teacher’s complete feature set from partial observations, achieved by masking specific parts of the student's feature map and using a generative block to fill in the missing elements. Unlike conventional methods that require direct imitation of the teacher's feature outputs, MGD emphasizes reconstructing the full feature representations from limited data, thus promoting stronger feature learning and robustness in the student model.

The method involves masking random pixels in the feature maps of the student model and then using these masked features to regenerate the complete feature map of the teacher model, utilizing a simple generative block composed of convolutional layers and non-linear activations. This training paradigm enhances the robustness and adaptability of the student's feature representations, allowing for effective knowledge transfer across different model architectures and tasks.

Experimental Evaluation

Extensive experiments were conducted to assess the performance of MGD. The approach yielded significant improvements in various tasks:

  • Image Classification: MGD demonstrated substantial enhancements, such as boosting ResNet-18's performance from 69.90% to 71.69% top-1 accuracy on ImageNet, illustrating its efficacy in enhancing model accuracy while maintaining simplicity.
  • Object Detection and Instance Segmentation: On the COCO dataset, MGD resulted in increased Average Precision (AP) scores. For instance, the RetinaNet model with a ResNet-50 backbone saw an increase from 37.4 to 41.0 Boundingbox mAP. Similarly, SOLO based on ResNet-50 was notably improved from 33.1 to 36.2 Mask mAP.
  • Semantic Segmentation: Evaluated on the CityScapes dataset, MGD improved DeepLabV3 based on ResNet-18 from 73.20 to 76.02 mIoU.

Implications and Future Directions

The implications of this research suggest that MGD offers a flexible and generalizable approach to enhancing model performance across various computer vision tasks without being tied to specific network architectures. This adaptability positions MGD as a potentially valuable tool in scenarios where deploying large teacher models is impractical due to constraints on computational resources.

Future developments could explore extending MGD to other domains beyond computer vision and further optimizing the generative block's structure and hyperparameters to maximize performance gains. Additionally, investigating the synergy of MGD with other auxiliary training objectives or distillation methods could reveal further enhancements and broader applicability.

Overall, the paper introduces a practical and innovative approach to knowledge distillation, highlighting the benefits of mask-based generative learning in improving the efficacy of student models across diverse tasks and architectures.

Github Logo Streamline Icon: https://streamlinehq.com