Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Cumulative Spatial Knowledge Distillation for Vision Transformers (2307.08500v1)

Published 17 Jul 2023 in cs.CV

Abstract: Distilling knowledge from convolutional neural networks (CNNs) is a double-edged sword for vision transformers (ViTs). It boosts the performance since the image-friendly local-inductive bias of CNN helps ViT learn faster and better, but leading to two problems: (1) Network designs of CNN and ViT are completely different, which leads to different semantic levels of intermediate features, making spatial-wise knowledge transfer methods (e.g., feature mimicking) inefficient. (2) Distilling knowledge from CNN limits the network convergence in the later training period since ViT's capability of integrating global information is suppressed by CNN's local-inductive-bias supervision. To this end, we present Cumulative Spatial Knowledge Distillation (CSKD). CSKD distills spatial-wise knowledge to all patch tokens of ViT from the corresponding spatial responses of CNN, without introducing intermediate features. Furthermore, CSKD exploits a Cumulative Knowledge Fusion (CKF) module, which introduces the global response of CNN and increasingly emphasizes its importance during the training. Applying CKF leverages CNN's local inductive bias in the early training period and gives full play to ViT's global capability in the later one. Extensive experiments and analysis on ImageNet-1k and downstream datasets demonstrate the superiority of our CSKD. Code will be publicly available.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Borui Zhao (13 papers)
  2. Renjie Song (12 papers)
  3. Jiajun Liang (37 papers)
Citations (10)

Summary

  • The paper presents a CSKD method that transfers CNN spatial responses to ViTs, bridging the gap between differing feature representations.
  • It incorporates a Cumulative Knowledge Fusion module to gradually blend CNN insights, preserving early local inductive bias while enhancing global integration.
  • Extensive experiments on ImageNet-1k and downstream datasets demonstrate that CSKD significantly improves ViT performance.

The paper "Cumulative Spatial Knowledge Distillation for Vision Transformers" addresses the challenge of transferring knowledge from Convolutional Neural Networks (CNNs) to Vision Transformers (ViTs). The authors identify two main issues with the conventional knowledge distillation approach from CNNs to ViTs:

  1. Differing Network Designs: CNNs and ViTs have different architectural designs, which results in differing semantic levels of intermediate features. This discrepancy makes spatial-wise knowledge transfer methods, such as feature mimicking, inefficient when applying CNN-derived knowledge to ViTs.
  2. Local-Inductive Bias Limitations: While CNNs have a local inductive bias that's helpful in the initial learning phases, this bias can suppress the ViT's ability to integrate global information effectively during the later stages of training.

To overcome these challenges, the authors propose a method called Cumulative Spatial Knowledge Distillation (CSKD). CSKD fundamentally changes how spatial knowledge is transferred by focusing on the spatial responses of CNNs and applying them to all patch tokens of the ViT. This method avoids the direct use of intermediate features, thus tackling the first issue mentioned.

Moreover, CSKD incorporates a Cumulative Knowledge Fusion (CKF) module. The CKF module gradually increases the influence of CNN's global responses as the training progresses. This strategic modulation allows the method to exploit the local inductive bias of CNNs early in the training process while enabling ViTs to fully utilize their global integrative capabilities in later stages.

The paper reports extensive experiments on ImageNet-1k and various downstream datasets, demonstrating the superior performance of the CSKD approach. The authors plan to make their code publicly available, further contributing to the reproducibility and applicability of their proposed method.