Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ReffAKD: Resource-efficient Autoencoder-based Knowledge Distillation (2404.09886v1)

Published 15 Apr 2024 in cs.LG and cs.CV

Abstract: In this research, we propose an innovative method to boost Knowledge Distillation efficiency without the need for resource-heavy teacher models. Knowledge Distillation trains a smaller student'' model with guidance from a largerteacher'' model, which is computationally costly. However, the main benefit comes from the soft labels provided by the teacher, helping the student grasp nuanced class similarities. In our work, we propose an efficient method for generating these soft labels, thereby eliminating the need for a large teacher model. We employ a compact autoencoder to extract essential features and calculate similarity scores between different classes. Afterward, we apply the softmax function to these similarity scores to obtain a soft probability vector. This vector serves as valuable guidance during the training of the student model. Our extensive experiments on various datasets, including CIFAR-100, Tiny Imagenet, and Fashion MNIST, demonstrate the superior resource efficiency of our approach compared to traditional knowledge distillation methods that rely on large teacher models. Importantly, our approach consistently achieves similar or even superior performance in terms of model accuracy. We also perform a comparative study with various techniques recently developed for knowledge distillation showing our approach achieves competitive performance with using significantly less resources. We also show that our approach can be easily added to any logit based knowledge distillation method. This research contributes to making knowledge distillation more accessible and cost-effective for practical applications, making it a promising avenue for improving the efficiency of model training. The code for this work is available at, https://github.com/JEKimLab/ReffAKD.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  2. Imagenet classification with deep convolutional neural networks. Neural Information Processing Systems, 25, 01 2012.
  3. Very deep convolutional networks for large-scale image recognition, 2014.
  4. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
  5. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  6. Wide residual networks, 2016.
  7. Residual attention network for image classification, 2017.
  8. An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
  9. Training data-efficient image transformers & distillation through attention, 2020.
  10. Alex Krizhevsky. Learning multiple layers of features from tiny images, 2009.
  11. Ya Le and Xuan Yang. Tiny imagenet visual recognition challenge. CS 231N, 7(7), 2015.
  12. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017.
  13. Memory-efficient implementation of densenets, 2017.
  14. Prue: Distilling knowledge from sparse teacher networks, 2022.
  15. Eliciting and learning with soft labels from every annotator. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, volume 10, pages 40–52, 2022.
  16. Human uncertainty makes classification more robust. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9617–9626, 2019.
  17. A case for soft loss functions. Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, 8(1):173–177, Oct. 2020.
  18. Decoupled knowledge distillation. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 11953–11962, 2022.
  19. Learning deep representations with probabilistic knowledge transfer. In Proceedings of the European Conference on Computer Vision (ECCV), pages 268–284, 2018.
  20. Self-supervised knowledge distillation using singular value decomposition. In Proceedings of the European conference on computer vision (ECCV), pages 335–350, 2018.
  21. Revisiting knowledge distillation via label smoothing regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3903–3911, 2020.
  22. Knowledge distillation: A good teacher is patient and consistent. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10925–10934, 2022.
  23. Fitnets: Hints for thin deep nets. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
  24. Structured knowledge distillation for dense prediction. IEEE transactions on pattern analysis and machine intelligence, 2020.
  25. Incremental learning through deep adaptation. IEEE transactions on pattern analysis and machine intelligence, 42(3):651–663, 2018.
  26. Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. In ICLR, 2017.
  27. Relational knowledge distillation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3967–3976, 2019.
  28. Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop, 2015.
  29. Born again neural networks. In International Conference on Machine Learning, pages 1607–1616. PMLR, 2018.
  30. Improved knowledge distillation via teacher assistant. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 5191–5198, 2020.
  31. TinyBERT: Distilling BERT for natural language understanding. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4163–4174, Online, November 2020. Association for Computational Linguistics.
  32. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Divyang Doshi (1 paper)
  2. Jung-Eun Kim (15 papers)