Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Categories of Response-Based, Feature-Based, and Relation-Based Knowledge Distillation (2306.10687v1)

Published 19 Jun 2023 in cs.CV

Abstract: Deep neural networks have achieved remarkable performance for artificial intelligence tasks. The success behind intelligent systems often relies on large-scale models with high computational complexity and storage costs. The over-parameterized networks are often easy to optimize and can achieve better performance. However, it is challenging to deploy them over resource-limited edge-devices. Knowledge Distillation (KD) aims to optimize a lightweight network from the perspective of over-parameterized training. The traditional offline KD transfers knowledge from a cumbersome teacher to a small and fast student network. When a sizeable pre-trained teacher network is unavailable, online KD can improve a group of models by collaborative or mutual learning. Without needing extra models, Self-KD boosts the network itself using attached auxiliary architectures. KD mainly involves knowledge extraction and distillation strategies these two aspects. Beyond KD schemes, various KD algorithms are widely used in practical applications, such as multi-teacher KD, cross-modal KD, attention-based KD, data-free KD and adversarial KD. This paper provides a comprehensive KD survey, including knowledge categories, distillation schemes and algorithms, as well as some empirical studies on performance comparison. Finally, we discuss the open challenges of existing KD works and prospect the future directions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (10)
  1. et al., C.: Emerging properties in self-supervised vision transformers. In: ICCV. pp. 9650–9660 (2021)
  2. et al., C.: Exploring simple siamese representation learning. In: CVPR. pp. 15750–15758 (2021)
  3. et al, C.: Dearkd: Data-efficient early knowledge distillation for vision transformers. In: CVPR. pp. 12052–12062 (2022)
  4. et al, H.: Momentum contrast for unsupervised visual representation learning. In: CVPR. pp. 9729–9738 (2020)
  5. et al, J.: Self-distilled self-supervised representation learning. arXiv preprint arXiv:2111.12958 (2021)
  6. et al., J.: Efficient vision transformers via fine-grained manifold distillation. arXiv preprint arXiv:2107.01378 (2021)
  7. et al., W.: Attention distillation: self-supervised vision transformer students need more guidance. arXiv preprint arXiv:2210.00944 (2022)
  8. et al, Y.: Vitkd: Practical guidelines for vit feature knowledge distillation. arXiv preprint arXiv:2209.02432 (2022)
  9. et al, Z.: Bootstrapping vits: Towards liberating vision transformers from pre-training. In: CVPR. pp. 8944–8953 (2022)
  10. Guo, J.: Reducing the teacher-student gap via adaptive temperatures. openreview (2021)
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Chuanguang Yang (36 papers)
  2. Xinqiang Yu (8 papers)
  3. Zhulin An (43 papers)
  4. Yongjun Xu (81 papers)
Citations (13)

Summary

We haven't generated a summary for this paper yet.