Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Reducing the Teacher-Student Gap via Spherical Knowledge Disitllation (2010.07485v5)

Published 15 Oct 2020 in cs.LG and cs.CV

Abstract: Knowledge distillation aims at obtaining a compact and effective model by learning the mapping function from a much larger one. Due to the limited capacity of the student, the student would underfit the teacher. Therefore, student performance would unexpectedly drop when distilling from an oversized teacher, termed the capacity gap problem. We investigate this problem by study the gap of confidence between teacher and student. We find that the magnitude of confidence is not necessary for knowledge distillation and could harm the student performance if the student are forced to learn confidence. We propose Spherical Knowledge Distillation to eliminate this gap explicitly, which eases the underfitting problem. We find this novel knowledge representation can improve compact models with much larger teachers and is robust to temperature. We conducted experiments on both CIFAR100 and ImageNet, and achieve significant improvement. Specifically, we train ResNet18 to 73.0 accuracy, which is a substantial improvement over previous SOTA and is on par with resnet34 almost twice the student size. The implementation has been shared at https://github.com/forjiuzhou/Spherical-Knowledge-Distillation.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Jia Guo (101 papers)
  2. Minghao Chen (37 papers)
  3. Yao Hu (106 papers)
  4. Chen Zhu (103 papers)
  5. Xiaofei He (70 papers)
  6. Deng Cai (181 papers)
Citations (6)
Github Logo Streamline Icon: https://streamlinehq.com