Sinkhorn Distance Minimization for Knowledge Distillation (2402.17110v1)

Published 27 Feb 2024 in cs.LG and cs.CL

Abstract: Knowledge distillation (KD) has been widely adopted to compress LLMs. Existing KD methods investigate various divergence measures including the Kullback-Leibler (KL), reverse Kullback-Leibler (RKL), and Jensen-Shannon (JS) divergences. However, due to limitations inherent in their assumptions and definitions, these measures fail to deliver effective supervision when few distribution overlap exists between the teacher and the student. In this paper, we show that the aforementioned KL, RKL, and JS divergences respectively suffer from issues of mode-averaging, mode-collapsing, and mode-underestimation, which deteriorates logits-based KD for diverse NLP tasks. We propose the Sinkhorn Knowledge Distillation (SinKD) that exploits the Sinkhorn distance to ensure a nuanced and precise assessment of the disparity between teacher and student distributions. Besides, profit by properties of the Sinkhorn metric, we can get rid of sample-wise KD that restricts the perception of divergence in each teacher-student sample pair. Instead, we propose a batch-wise reformulation to capture geometric intricacies of distributions across samples in the high-dimensional space. Comprehensive evaluation on GLUE and SuperGLUE, in terms of comparability, validity, and generalizability, highlights our superiority over state-of-the-art methods on all kinds of LLMs with encoder-only, encoder-decoder, and decoder-only architectures.

PDF HTML Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

References (65)

Authors (10)

Xiao Cui (11 papers)
Yulei Qin (17 papers)
Yuting Gao (25 papers)
Enwei Zhang (9 papers)
Zihan Xu (31 papers)
Tong Wu (228 papers)
Ke Li (722 papers)
Xing Sun (93 papers)
Wengang Zhou (153 papers)
Houqiang Li (236 papers)

Citations (2)

View on Semantic Scholar

Sinkhorn Distance Minimization for Knowledge Distillation (2402.17110v1)

Related Papers