Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Unraveling Key Factors of Knowledge Distillation (2312.08585v2)

Published 14 Dec 2023 in cs.CL and cs.LG

Abstract: Knowledge distillation, a technique for model compression and performance enhancement, has gained significant traction in Neural Machine Translation (NMT). However, existing research primarily focuses on empirical applications, and there is a lack of comprehensive understanding of how student model capacity, data complexity, and decoding strategies collectively influence distillation effectiveness. Addressing this gap, our study conducts an in-depth investigation into these factors, particularly focusing on their interplay in word-level and sequence-level distillation within NMT. Through extensive experimentation across datasets like IWSLT13 En$\rightarrow$Fr, IWSLT14 En$\rightarrow$De, and others, we empirically validate hypotheses related to the impact of these factors on knowledge distillation. Our research not only elucidates the significant influence of model capacity, data complexity, and decoding strategies on distillation effectiveness but also introduces a novel, optimized distillation approach. This approach, when applied to the IWSLT14 de$\rightarrow$en translation task, achieves state-of-the-art performance, demonstrating its practical efficacy in advancing the field of NMT.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
  2. A. Gajbhiye, M. Fomicheva, F. Alva-Manchego, F. Blain, A. Obamuyide, N. Aletras, and L. Specia, “Knowledge distillation for quality estimation,” in Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, Online Event, August 1-6, 2021, ser. Findings of ACL, C. Zong, F. Xia, W. Li, and R. Navigli, Eds., vol. ACL/IJCNLP 2021.   Association for Computational Linguistics, 2021, pp. 5091–5099.
  3. Z. Yang, R. Sun, and X. Wan, “Nearest neighbor knowledge distillation for neural machine translation,” in Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022, Seattle, WA, United States, July 10-15, 2022, M. Carpuat, M. de Marneffe, and I. V. M. Ruíz, Eds.   Association for Computational Linguistics, 2022, pp. 5546–5556.
  4. U. Cappellazzo, M. Yang, D. Falavigna, and A. Brutti, “Sequence-level knowledge distillation for class-incremental end-to-end spoken language understanding,” arXiv preprint arXiv:2305.13899, 2023.
  5. Y. Kim and A. M. Rush, “Sequence-level knowledge distillation,” in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016, pp. 1317–1327.
  6. R. M. Mun’im, N. Inoue, and K. Shinoda, “Sequence-level knowledge distillation for model compression of attention-based sequence-to-sequence speech recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2019, Brighton, United Kingdom, May 12-17, 2019.   IEEE, 2019, pp. 6151–6155.
  7. X. Wang, Y. Jiang, N. Bach, T. Wang, F. Huang, and K. Tu, “Structure-level knowledge distillation for multilingual sequence labeling,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, D. Jurafsky, J. Chai, N. Schluter, and J. R. Tetreault, Eds.   Association for Computational Linguistics, 2020, pp. 3317–3330.
  8. Z. Li, Y. Ming, L. Yang, and J. Xue, “Mutual-learning sequence-level knowledge distillation for automatic speech recognition,” Neurocomputing, vol. 428, pp. 259–267, 2021.
  9. H. Zhao, X. Sun, J. Dong, Z. Dong, and Q. Li, “Knowledge distillation via instance-level sequence learning,” Knowl. Based Syst., vol. 233, p. 107519, 2021.
  10. X. Tan, Y. Ren, D. He, T. Qin, Z. Zhao, and T. Liu, “Multilingual neural machine translation with knowledge distillation,” in 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019.   OpenReview.net, 2019.
  11. H. Xu, B. V. Durme, and K. W. Murray, “Bert, mbert, or bibert? A study on contextualized embeddings for neural machine translation,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, M. Moens, X. Huang, L. Specia, and S. W. Yih, Eds.   Association for Computational Linguistics, 2021, pp. 6663–6675.
  12. J. Tang, R. Shivanna, Z. Zhao, D. Lin, A. Singh, E. H. Chi, and S. Jain, “Understanding and improving knowledge distillation,” arXiv preprint arXiv:2002.03532, 2020.
  13. A. K. Menon, A. S. Rawat, S. Reddi, S. Kim, and S. Kumar, “A statistical perspective on distillation,” in International Conference on Machine Learning.   PMLR, 2021, pp. 7632–7642.
  14. R. Wang, X. Tan, R. Luo, T. Qin, and T.-Y. Liu, “A survey on low-resource neural machine translation,” Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, pp. 4636–4643, 2021.
  15. C. Zhou, J. Gu, and G. Neubig, “Understanding knowledge distillation in non-autoregressive machine translation,” in 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020.   OpenReview.net, 2020.
  16. R. Anil, G. Pereyra, A. Passos, R. Ormándi, G. E. Dahl, and G. E. Hinton, “Large scale distributed neural network training through online distillation,” in 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings.   OpenReview.net, 2018.
  17. B. Li, Z. Wang, H. Liu, Q. Du, T. Xiao, C. Zhang, and J. Zhu, “Learning light-weight translation models from deep transformer,” in Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021.   AAAI Press, 2021, pp. 13 217–13 225.
  18. Q. Zhang, X. Cheng, Y. Chen, and Z. Rao, “Quantifying the knowledge in a dnn to explain knowledge distillation for classification,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 4, pp. 5099–5113, 2022.
  19. M. A. Gordon and K. Duh, “Explaining sequence-level knowledge distillation as data-augmentation for neural machine translation,” arXiv preprint arXiv:1912.03334, 2019.
  20. C. Lu, J. Zhang, Y. Chu, Z. Chen, J. Zhou, F. Wu, H. Chen, and H. Yang, “Knowledge distillation of transformer-based language models revisited,” arXiv preprint arXiv:2206.14366, 2022.
  21. M. Cettolo, J. Niehues, S. Stüker, L. Bentivogli, and M. Federico, “Report on the 10th iwslt evaluation campaign,” in Proceedings of the 10th International Workshop on Spoken Language Translation: Evaluation Campaign, 2013.
  22. M. Federico, S. Stüker, and F. Yvon, “Proceedings of the 11th international workshop on spoken language translation: Evaluation campaign,” in Proceedings of the 11th International Workshop on Spoken Language Translation: Evaluation Campaign, 2014.
  23. M. Macháček and O. Bojar, “Results of the wmt14 metrics shared task,” in Proceedings of the Ninth Workshop on Statistical Machine Translation, 2014, pp. 293–301.
  24. S. Edunov, M. Ott, M. Auli, and D. Grangier, “Understanding back-translation at scale,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, E. Riloff, D. Chiang, J. Hockenmaier, and J. Tsujii, Eds.   Association for Computational Linguistics, 2018, pp. 489–500.
  25. L. Wu, J. Li, Y. Wang, Q. Meng, T. Qin, W. Chen, M. Zhang, T.-Y. Liu et al., “R-drop: Regularized dropout for neural networks,” Advances in Neural Information Processing Systems, vol. 34, pp. 10 890–10 905, 2021.
  26. N. Kambhatla, L. Born, and A. Sarkar, “Cipherdaug: Ciphertext based data augmentation for neural machine translation,” arXiv preprint arXiv:2204.00665, 2022.
  27. D. Shen, M. Zheng, Y. Shen, Y. Qu, and W. Chen, “A simple but tough-to-beat data augmentation approach for natural language understanding and generation,” arXiv preprint arXiv:2009.13818, 2020.
  28. N. Iyer, V. Thejas, N. Kwatra, R. Ramjee, and M. Sivathanu, “Wide-minima density hypothesis and the explore-exploit learning rate schedule,” Journal of Machine Learning Research, vol. 24, no. 65, pp. 1–37, 2023.
  29. P. Gao, Z. He, H. Wu, and H. Wang, “Bi-simcut: A simple strategy for boosting neural machine translation,” in Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022, pp. 3938–3948.
  30. T. Lohrenz, B. Möller, Z. Li, and T. Fingscheidt, “Relaxed attention for transformer models,” in 2023 International Joint Conference on Neural Networks (IJCNN).   IEEE, 2023, pp. 1–10.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Jingxuan Wei (21 papers)
  2. Linzhuang Sun (18 papers)
  3. Xu Tan (164 papers)
  4. Bihui Yu (16 papers)
  5. Ruifeng Guo (10 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com