Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ViTKD: Practical Guidelines for ViT feature knowledge distillation (2209.02432v1)

Published 6 Sep 2022 in cs.CV

Abstract: Knowledge Distillation (KD) for Convolutional Neural Network (CNN) is extensively studied as a way to boost the performance of a small model. Recently, Vision Transformer (ViT) has achieved great success on many computer vision tasks and KD for ViT is also desired. However, besides the output logit-based KD, other feature-based KD methods for CNNs cannot be directly applied to ViT due to the huge structure gap. In this paper, we explore the way of feature-based distillation for ViT. Based on the nature of feature maps in ViT, we design a series of controlled experiments and derive three practical guidelines for ViT's feature distillation. Some of our findings are even opposite to the practices in the CNN era. Based on the three guidelines, we propose our feature-based method ViTKD which brings consistent and considerable improvement to the student. On ImageNet-1k, we boost DeiT-Tiny from 74.42% to 76.06%, DeiT-Small from 80.55% to 81.95%, and DeiT-Base from 81.76% to 83.46%. Moreover, ViTKD and the logit-based KD method are complementary and can be applied together directly. This combination can further improve the performance of the student. Specifically, the student DeiT-Tiny, Small, and Base achieve 77.78%, 83.59%, and 85.41%, respectively. The code is available at https://github.com/yzd-v/cls_KD.

Insights into ViTKD: A Comprehensive Approach to Feature Knowledge Distillation in Vision Transformers

The paper "ViTKD: Practical Guidelines for ViT Feature Knowledge Distillation" addresses a significant gap in the application of Knowledge Distillation (KD) to Vision Transformers (ViTs). While KD has been extensively utilized to enhance Convolutional Neural Networks (CNNs), its direct application to ViTs presents challenges due to structural differences. The authors of this paper propose and validate ViTKD—an innovative feature-based distillation framework tailored specifically for Vision Transformers.

Framework Overview and Methodological Advances

ViTKD offers a structured methodology for feature knowledge distillation in ViTs. Through a series of controlled experiments, the authors extract three guiding principles that steer the development of effective distillation strategies in this context:

  1. Generation vs. Mimicking for Deep Layers: It is observed that generation methods outperform mimicking in the context of deep layers of ViT models. This is contrary to practices seen in CNN feature distillation, where direct mimicking is prevalent. The disparity underlines the distinctive architecture and processing paths inherent to ViTs compared to CNNs.
  2. Viability of Shallow Layer Distillation: Unlike CNNs, where shallow layer features are often considered too limited for effective distillation, ViTs benefit from mimicking in their shallow layers. This involves aligning the dimensions of student and teacher features, allowing the student model to emulate foundational attention patterns present in the early layers of the teacher model.
  3. Preference for FFN-over MHA-Out Features: The Feed-Forward Network (FFN) output features in the ViT architecture are identified as being more suitable for knowledge transfer than features obtained from Multi-Head Attention (MHA) outputs.

Building upon these guidelines, the authors introduce ViTKD, which systematically incorporates both mimicking of shallow layers and generation methodologies for deeper layers. This multifaceted approach results in consistent and notable improvements across various ViT models, such as boosting the performance of DeiT-Tiny from a Top-1 accuracy of 74.42% to 76.06% using a DeiT III-Small teacher on ImageNet-1K. Notably, ViTKD can be directly integrated with logit-based methods, augmenting the student's performance further.

Experimental Validation and Broader Implications

The experimental results highlight ViTKD's proficiency in not only enhancing classification tasks but also its utility in downstream tasks, such as object detection on the COCO dataset. The models guided by ViTKD demonstrate increased efficacy in these applications, validating the value of enriched feature transfer through distillation.

However, the paper also emphasizes the importance of architectural congruency between teacher and student models. Distillations involving cross-architecture teacher-student pairs showcased marked performance drops, likely due to incongruent attention mechanisms and feature representations.

Future Directions

Future work can aim to refine the distillation strategies within ViTKD’s framework, especially concerning the simplification seen in current mimicking and generation approaches. Moreover, exploring cross-architecture feature knowledge transfer remains an open area with potential for novel insights.

Conclusion

In sum, ViTKD constitutes a sophisticated and comprehensive approach for distilling knowledge in Vision Transformers, yielding significant performance enhancements. This methodology underscores the importance of adapting KD techniques to the unique architectures of ViT models while highlighting potential for future exploration in broader and more complex architectures.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Zhendong Yang (10 papers)
  2. Zhe Li (210 papers)
  3. Ailing Zeng (58 papers)
  4. Zexian Li (11 papers)
  5. Chun Yuan (127 papers)
  6. Yu Li (378 papers)
Citations (41)
Github Logo Streamline Icon: https://streamlinehq.com