Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Sinkhorn Distance Minimization for Knowledge Distillation (2402.17110v1)

Published 27 Feb 2024 in cs.LG and cs.CL

Abstract: Knowledge distillation (KD) has been widely adopted to compress LLMs. Existing KD methods investigate various divergence measures including the Kullback-Leibler (KL), reverse Kullback-Leibler (RKL), and Jensen-Shannon (JS) divergences. However, due to limitations inherent in their assumptions and definitions, these measures fail to deliver effective supervision when few distribution overlap exists between the teacher and the student. In this paper, we show that the aforementioned KL, RKL, and JS divergences respectively suffer from issues of mode-averaging, mode-collapsing, and mode-underestimation, which deteriorates logits-based KD for diverse NLP tasks. We propose the Sinkhorn Knowledge Distillation (SinKD) that exploits the Sinkhorn distance to ensure a nuanced and precise assessment of the disparity between teacher and student distributions. Besides, profit by properties of the Sinkhorn metric, we can get rid of sample-wise KD that restricts the perception of divergence in each teacher-student sample pair. Instead, we propose a batch-wise reformulation to capture geometric intricacies of distributions across samples in the high-dimensional space. Comprehensive evaluation on GLUE and SuperGLUE, in terms of comparability, validity, and generalizability, highlights our superiority over state-of-the-art methods on all kinds of LLMs with encoder-only, encoder-decoder, and decoder-only architectures.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (65)
  1. Martin Arjovsky and Leon Bottou. 2017. Towards principled methods for training generative adversarial networks. In International Conference on Learning Representations.
  2. Wasserstein generative adversarial networks. In International conference on machine learning, pages 214–223. PMLR.
  3. The fifth pascal recognizing textual entailment challenge. TAC, 7:8.
  4. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow. If you use this software, please cite it using these metadata.
  5. Language models are few-shot learners. In Advances in neural information processing systems, pages 1877–1901.
  6. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 535–541.
  7. Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055.
  8. Unsupervised domain adaptation of bearing fault diagnosis based on join sliced wasserstein distance. ISA transactions, 129:504–519.
  9. Quora question pairs.
  10. Joint distribution optimal transportation for domain adaptation. Advances in neural information processing systems, 30.
  11. Marco Cuturi. 2013. Sinkhorn distances: Lightspeed computation of optimal transport. Advances in neural information processing systems, 26.
  12. The commitmentbank: Investigating projection in naturally occurring discourse. In proceedings of Sinn und Bedeutung, volume 23, pages 107–124.
  13. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  14. Bill Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In Third International Workshop on Paraphrasing (IWP2005).
  15. Mosaicking to distill: Knowledge distillation from out-of-domain data. Advances in Neural Information Processing Systems, 34:11920–11932.
  16. Learning with a wasserstein loss. Advances in neural information processing systems, 28.
  17. Learning generative models with sinkhorn divergences. In International Conference on Artificial Intelligence and Statistics, pages 1608–1617. PMLR.
  18. Unsupervised domain adaptation for covid-19 classification based on balanced slice wasserstein distance. Computers in Biology and Medicine, page 107207.
  19. Knowledge distillation of large language models. arXiv preprint arXiv:2306.08543.
  20. Improved training of wasserstein gans. Advances in neural information processing systems, 30.
  21. Wasserstein unsupervised reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 6884–6892.
  22. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
  23. Tinybert: Distilling bert for natural language understanding. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4163–4174.
  24. Knowledge distillation via route constrained optimization. In ICCV, pages 1345–1354.
  25. Sanjula Kammammettu and Zukui Li. 2023. Scenario reduction and scenario tree generation for stochastic programming using sinkhorn distance. Computers & Chemical Engineering, 170:108122.
  26. Comparing kullback-leibler divergence and mean squared error loss in knowledge distillation. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, pages 2628–2635.
  27. Yoon Kim and Alexander M Rush. 2016. Sequence-level knowledge distillation. arXiv preprint arXiv:1606.07947.
  28. Improving machine translation and summarization with the sinkhorn divergence. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 149–161. Springer.
  29. Multi-granularity structural knowledge distillation for language model compression. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1001–1011.
  30. Bilaterally normalized scale-consistent sinkhorn distance for few-shot image classification. IEEE Transactions on Neural Networks and Learning Systems.
  31. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  32. Improved knowledge distillation via teacher assistant. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 5191–5198.
  33. Thong Thanh Nguyen and Anh Tuan Luu. 2022. Improving neural cross-lingual abstractive summarization via employing optimal transport distance for knowledge distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 11103–11111.
  34. f-gan: Training generative neural samplers using variational divergence minimization. Advances in neural information processing systems, 29.
  35. Learning student-friendly teacher networks for knowledge distillation. Advances in neural information processing systems, 34:13292–13303.
  36. Distilling linguistic context for language model compression. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 364–378.
  37. Relational knowledge distillation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3967–3976.
  38. Computational optimal transport: With applications to data science. Foundations and Trends® in Machine Learning, 11(5-6):355–607.
  39. Language models are unsupervised multitask learners. OpenAI blog.
  40. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392.
  41. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series.
  42. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
  43. Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207.
  44. Learning from deep model via exploring local targets.
  45. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642.
  46. Patient knowledge distillation for bert model compression. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4323–4332.
  47. Mobilebert: a compact task-agnostic bert for resource-limited devices. arXiv preprint arXiv:2004.02984.
  48. Engine: Energy-based inference networks for non-autoregressive machine translation. arXiv preprint arXiv:2005.00850.
  49. Well-read students learn better: On the importance of pre-training compact models. arXiv preprint arXiv:1908.08962.
  50. SS Vallender. 1974a. Calculation of the wasserstein distance between probability distributions on the line. Theory of Probability & Its Applications, 18(4):784–786.
  51. SS Vallender. 1974b. Calculation of the wasserstein distance between probability distributions on the line. Theory of Probability & Its Applications, 18(4):784–786.
  52. Cédric Villani and Cédric Villani. 2009. The wasserstein distances. Optimal Transport: Old and New, pages 93–111.
  53. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32.
  54. Glue: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355.
  55. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Advances in Neural Information Processing Systems, 33:5776–5788.
  56. Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7:625–641.
  57. f-divergence minimization for sequence-level knowledge distillation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10817–10834.
  58. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122.
  59. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pages 38–45.
  60. Ad-kd: Attribution-driven knowledge distillation for language model compression. arXiv preprint arXiv:2305.10010.
  61. Dreaming to distill: Data-free knowledge transfer via deepinversion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8715–8724.
  62. Training deep energy-based models with f-divergence minimization. In International Conference on Machine Learning, pages 10957–10967. PMLR.
  63. Reaugkd: Retrieval-augmented knowledge distillation for pre-trained language models.
  64. Deep mutual learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4320–4328.
  65. Bert learns to teach: Knowledge distillation with meta learning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7037–7049.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Xiao Cui (11 papers)
  2. Yulei Qin (17 papers)
  3. Yuting Gao (25 papers)
  4. Enwei Zhang (9 papers)
  5. Zihan Xu (31 papers)
  6. Tong Wu (228 papers)
  7. Ke Li (722 papers)
  8. Xing Sun (93 papers)
  9. Wengang Zhou (153 papers)
  10. Houqiang Li (236 papers)
Citations (2)