Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Bayes Conditional Distribution Estimation for Knowledge Distillation Based on Conditional Mutual Information (2401.08732v2)

Published 16 Jan 2024 in cs.LG, cs.CV, cs.IT, and math.IT

Abstract: It is believed that in knowledge distillation (KD), the role of the teacher is to provide an estimate for the unknown Bayes conditional probability distribution (BCPD) to be used in the student training process. Conventionally, this estimate is obtained by training the teacher using maximum log-likelihood (MLL) method. To improve this estimate for KD, in this paper we introduce the concept of conditional mutual information (CMI) into the estimation of BCPD and propose a novel estimator called the maximum CMI (MCMI) method. Specifically, in MCMI estimation, both the log-likelihood and CMI of the teacher are simultaneously maximized when the teacher is trained. Through Eigen-CAM, it is further shown that maximizing the teacher's CMI value allows the teacher to capture more contextual information in an image cluster. Via conducting a thorough set of experiments, we show that by employing a teacher trained via MCMI estimation rather than one trained via MLL estimation in various state-of-the-art KD frameworks, the student's classification accuracy consistently increases, with the gain of up to 3.32\%. This suggests that the teacher's BCPD estimate provided by MCMI method is more accurate than that provided by MLL method. In addition, we show that such improvements in the student's accuracy are more drastic in zero-shot and few-shot settings. Notably, the student's accuracy increases with the gain of up to 5.72\% when 5\% of the training samples are available to the student (few-shot), and increases from 0\% to as high as 84\% for an omitted class (zero-shot). The code is available at \url{https://github.com/iclr2024mcmi/ICLRMCMI}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (71)
  1. Variational information distillation for knowledge transfer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  9163–9171, 2019.
  2. Label-embedding for image classification. IEEE transactions on pattern analysis and machine intelligence, 38(7):1425–1438, 2015.
  3. Towards understanding ensemble, knowledge distillation and self-distillation in deep learning. arXiv preprint arXiv:2012.09816, 2020.
  4. Large scale distributed neural network training through online distillation. arXiv preprint arXiv:1804.03235, 2018.
  5. Eigen-cam: Visual explanations for deep convolutional neural networks. SN Computer Science, 2:1–14, 2021.
  6. Knowledge distillation: A good teacher is patient and consistent. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  10925–10934, 2022.
  7. Generating visual representations for zero-shot classification. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp.  2666–2673, 2017.
  8. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp.  535–541, 2006.
  9. Online knowledge distillation with diverse peers. In Proceedings of the AAAI Conference on Artificial Intelligence, pp.  3430–3437, 2020a.
  10. Distilling knowledge via knowledge review. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  5008–5017, 2021a.
  11. Distilling knowledge via knowledge review. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  5008–5017, 2021b.
  12. Big self-supervised models are strong semi-supervised learners. Advances in neural information processing systems, 33:22243–22255, 2020b.
  13. A closer look at few-shot classification. In International Conference on Learning Representations, 2018.
  14. On the efficacy of knowledge distillation. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  4794–4802, 2019.
  15. Thomas M Cover. Elements of information theory. John Wiley & Sons, 1999.
  16. Knowledge distillation as semiparametric inference. In International Conference on Learning Representations, 2020.
  17. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp.  248–255. Ieee, 2009.
  18. Born again neural networks. In International Conference on Machine Learning, pp. 1607–1616. PMLR, 2018.
  19. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  20. Fine-grained generalized zero-shot learning via dense attribute-based attention. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  4483–4493, 2020.
  21. Weighted distillation with unlabeled examples. Advances in Neural Information Processing Systems, 35:7024–7037, 2022.
  22. Annealing knowledge distillation. arXiv preprint arXiv:2104.07163, 2021.
  23. Learning multiple layers of features from tiny images. University of Toronto, 2009.
  24. Cifar-10 (canadian institute for advanced research). University of Toronto, 2012. URL http://www.cs.toronto.edu/~kriz/cifar.html.
  25. Online knowledge distillation via multi-branch diversity enhancement. In Proceedings of the Asian Conference on Computer Vision, 2020.
  26. Curriculum temperature for knowledge distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp.  1504–1512, 2023.
  27. Meta knowledge distillation. arXiv preprint arXiv:2202.07940, 2022.
  28. Exploring inter-channel correlation for diversity-preserved knowledge distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  8271–8280, 2021.
  29. Knowledge distillation via instance relationship graph. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  7096–7104, 2019.
  30. A statistical perspective on distillation. In International Conference on Machine Learning, pp. 7632–7642. PMLR, 2021.
  31. Improved knowledge distillation via teacher assistant. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.  5191–5198, 2020.
  32. Self-distillation amplifies regularization in hilbert space. Advances in Neural Information Processing Systems, 33:3351–3361, 2020.
  33. Subclass distillation. arXiv preprint arXiv:2002.03936, 2020.
  34. Relational knowledge distillation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  3967–3976, 2019a.
  35. Relational knowledge distillation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  3967–3976, 2019b.
  36. Learning deep representations with probabilistic knowledge transfer. In Proceedings of the European Conference on Computer Vision (ECCV), pp.  268–284, 2018.
  37. Heterogeneous knowledge distillation using information flow modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  2339–2348, 2020.
  38. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  39. Correlation congruence for knowledge distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  5007–5016, 2019.
  40. Towards understanding knowledge distillation. In International conference on machine learning, pp. 5142–5151. PMLR, 2019.
  41. Data distillation: Towards omni-supervised learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  4119–4128, 2018.
  42. Better supervisory signals by observing learning paths. In International Conference on Learning Representations, 2021.
  43. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.
  44. Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252, 2015.
  45. On the efficiency of subclass knowledge distillation in classification tasks. arXiv preprint arXiv:2109.05587, 2021.
  46. Estimating and maximizing mutual information for knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  48–57, 2023.
  47. Does knowledge distillation really work? Advances in Neural Information Processing Systems, 34:6906–6919, 2021.
  48. Improving knowledge distillation with a customized teacher. IEEE Transactions on Neural Networks and Learning Systems, 2022.
  49. Understanding and improving knowledge distillation. arXiv preprint arXiv:2002.03532, 2020.
  50. Farewell to mutual information: Variational distillation for cross-modal person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  1522–1531, 2021.
  51. Contrastive representation distillation. In International Conference on Learning Representations, 2019.
  52. Deep learning and the information bottleneck principle. In 2015 ieee information theory workshop (itw), pp.  1–5. IEEE, 2015.
  53. The information bottleneck method. arXiv preprint physics/0004057, 2000.
  54. Kari Torkkola. Feature extraction by non-parametric mutual information maximization. Journal of machine learning research, 3(Mar):1415–1438, 2003.
  55. Similarity-preserving knowledge distillation. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp.  1365–1374, 2019. URL https://api.semanticscholar.org/CorpusID:198179476.
  56. Efficient online subclass knowledge distillation for image classification. In 2020 25th International Conference on Pattern Recognition (ICPR), pp.  1007–1014, 2021a. doi: 10.1109/ICPR48806.2021.9411995.
  57. Online subclass knowledge distillation. Expert Systems with Applications, 181:115132, 2021b.
  58. Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
  59. Efficient knowledge distillation from model checkpoints. Advances in Neural Information Processing Systems, 35:607–619, 2022.
  60. Zero-shot learning-the good, the bad and the ugly. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  4582–4591, 2017.
  61. Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  10687–10698, 2020.
  62. Training deep neural networks in generations: A more tolerant teacher educates better students. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp.  5628–5635, 2019.
  63. Hierarchical self-supervised augmented knowledge distillation. arXiv preprint arXiv:2107.13715, 2021.
  64. Cross-image relational knowledge distillation for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  12319–12328, 2022.
  65. Conditional mutual information constrained deep learning for classification. arXiv preprint arXiv:2309.09123, 2023.
  66. Knowledge distillation via softmax regression representation learning. In International Conference on Learning Representations, 2020.
  67. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  4133–4141, 2017.
  68. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In International Conference on Learning Representations, 2016.
  69. Decoupled knowledge distillation. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp.  11953–11962, 2022.
  70. Complementary relation contrastive distillation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  9260–9269, 2021.
  71. Knowledge distillation by on-the-fly native ensemble. Advances in neural information processing systems, 31, 2018.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Linfeng Ye (10 papers)
  2. Shayan Mohajer Hamidi (15 papers)
  3. Renhao Tan (4 papers)
  4. En-hui Yang (19 papers)
Citations (6)