Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AdaKD: Dynamic Knowledge Distillation of ASR models using Adaptive Loss Weighting (2405.08019v1)

Published 11 May 2024 in cs.LG and cs.AI

Abstract: Knowledge distillation, a widely used model compression technique, works on the basis of transferring knowledge from a cumbersome teacher model to a lightweight student model. The technique involves jointly optimizing the task specific and knowledge distillation losses with a weight assigned to them. Despite these weights playing a crucial role in the performance of the distillation process, current methods provide equal weight to both losses, leading to suboptimal performance. In this paper, we propose Adaptive Knowledge Distillation, a novel technique inspired by curriculum learning to adaptively weigh the losses at instance level. This technique goes by the notion that sample difficulty increases with teacher loss. Our method follows a plug-and-play paradigm that can be applied on top of any task-specific and distillation objectives. Experiments show that our method performs better than conventional knowledge distillation method and existing instance-level loss functions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv, abs/1810.04805, 2018.
  2. C. Doersch, A. Gupta, and A. A. Efros, “Unsupervised visual representation learning by context prediction,” In Proc. of ICCV, 2015.
  3. O. J. Henaff, A. Razavi, C. Doersch, S. M. A. Eslami, and A. van den Oord, “Data-efficient image recognition with contrastive predictive coding,” arXiv, abs/1905.09272, 2019.
  4. S. Schneider, A. Baevski, R. Collobert, and M. Auli, “wav2vec: Unsupervised pre-training for speech recognition,” arXiv, abs/1904.05862, 2019.
  5. A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” arXiv, abs/2006.11477, 2020.
  6. W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” arXiv, abs/2106.07447, 2021.
  7. A. Baevski, W.-N. Hsu, Q. Xu, A. Babu, J. Gu, and M. Auli, “data2vec: A general framework for self-supervised learning in speech, vision and language,” arXiv, abs/2202.03555, 2022.
  8. A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” arXiv, abs/2212.04356, 2022.
  9. G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
  10. V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter,” arXiv, abs/1910.01108, 2019.
  11. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv, abs/2010.11929, 2020.
  12. K. Wu, J. Zhang, M. L. Houwen Peng, B. Xiao, J. Fu, and L. Yuan, “Tinyvit: Fast pretraining distillation for small vision transformers,” arXiv, abs/2207.10666, 2022.
  13. Y. Fu, Y. Kang, S. Cao, and L. Ma, “Distillw2v2: A small and streaming wav2vec 2.0 based asr model,” arXiv, abs/2303.09278, 2023.
  14. D.-H. Kim, J.-H. Lee, J.-H. Mo, and J.-H. Chang, “W2v2-light: A lightweight version of wav2vec 2.0 for automatic speech recognition,” in Proc. Interspeech 2022, 2022, pp. 3038–3042.
  15. Z. Peng, A. Budhkar, I. Tuil, J. Levy, P. Sobhani, R. Cohen, and J. Nassour, “Shrinking bigfoot: Reducing wav2vec 2.0 footprint,” arXiv, abs/2103.15760, 2021.
  16. C. Buciluǎ, R. Caruana, and A. Niculescu-Mizil, “Model compression,” in Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, 2006, pp. 535–541.
  17. A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio, “Fitnets: Hints for thin deep nets,” arXiv preprint arXiv:1412.6550, 2014.
  18. B. B. Sau and V. N. Balasubramanian, “Deep model compression: Distilling knowledge from noisy teachers,” arXiv preprint arXiv:1610.09650, 2016.
  19. E. J. Crowley, G. Gray, and A. J. Storkey, “Moonshine: Distilling with cheap convolutions,” Advances in Neural Information Processing Systems, vol. 31, 2018.
  20. W. Park, D. Kim, Y. Lu, and M. Cho, “Relational knowledge distillation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 3967–3976.
  21. J. H. Cho and B. Hariharan, “On the efficacy of knowledge distillation,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 4794–4802.
  22. Y. Lee, K. Jang, J. Goo, Y. Jung, and H. Kim, “Fithubert: Going thinner and deeper for knowledge distillation of speech self-supervised learning,” arXiv preprint arXiv:2207.00555, 2022.
  23. Z. Peng, A. Budhkar, I. Tuil, J. Levy, P. Sobhani, R. Cohen, and J. Nassour, “Shrinking bigfoot: Reducing wav2vec 2.0 footprint,” arXiv preprint arXiv:2103.15760, 2021.
  24. H. Shao, W. Wang, B. Liu, X. Gong, H. Wang, and Y. Qian, “Whisper-kdq: A lightweight whisper via guided knowledge distillation and quantization for efficient asr,” arXiv preprint arXiv:2305.10788, 2023.
  25. Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum learning,” vol. 60, 06 2009, p. 6.
  26. T. Castells, P. Weinzaepfel, and J. Revaud, “Superloss: A generic loss for robust curriculum learning,” in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33.   Curran Associates, Inc., 2020, pp. 4308–4319. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2020/file/2cfa8f9e50e0f510ede9d12338a5f564-Paper.pdf
  27. T. Wu, X. Ding, H. Zhang, J. Gao, M. Tang, L. Du, B. Qin, and T. Liu, “Discrimloss: A universal loss for hard samples and incorrect samples discrimination,” IEEE Transactions on Multimedia, vol. 26, pp. 1957–1968, 2024.
  28. Z. Li, X. Li, L. Yang, B. Zhao, R. Song, L. Luo, J. Li, and J. Yang, “Curriculum temperature for knowledge distillation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 2, 2023, pp. 1504–1512.
  29. Q. Zhu, X. Chen, P. Wu, J. Liu, and D. Zhao, “Combining curriculum learning and knowledge distillation for dialogue generation,” in Findings of the Association for Computational Linguistics: EMNLP 2021, 2021, pp. 1284–1295.
  30. T. Lin, P. Goyal, R. B. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” CoRR, vol. abs/1708.02002, 2017. [Online]. Available: http://arxiv.org/abs/1708.02002
  31. A. Jafari, M. Rezagholizadeh, P. Sharma, and A. Ghodsi, “Annealing knowledge distillation,” CoRR, vol. abs/2104.07163, 2021. [Online]. Available: https://arxiv.org/abs/2104.07163
  32. A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances in neural information processing systems, vol. 32, 2019.
  33. Y.-Y. Yang, M. Hira, Z. Ni, A. Chourdia, A. Astafurov, C. Chen, C.-F. Yeh, C. Puhrsch, D. Pollack, D. Genzel, D. Greenberg, E. Yang, J. Lian, J. Mahadeokar, J. Hwang, J. Chen, P. Goldsborough, P. Roy, S. Narenthiran, and Y. Shi, “Torchaudio: Building blocks for audio and speech processing,” 10 2021.
  34. T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, and J. Brew, “Huggingface’s transformers: State-of-the-art natural language processing,” CoRR, vol. abs/1910.03771, 2019. [Online]. Available: http://arxiv.org/abs/1910.03771
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Shreyan Ganguly (5 papers)
  2. Roshan Nayak (3 papers)
  3. Rakshith Rao (1 paper)
  4. Ujan Deb (1 paper)
  5. Prathosh AP (23 papers)