Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multi-Stage Balanced Distillation: Addressing Long-Tail Challenges in Sequence-Level Knowledge Distillation (2406.13114v2)

Published 19 Jun 2024 in cs.CL and cs.AI

Abstract: LLMs have significantly advanced various natural language processing tasks, but deploying them remains computationally expensive. Knowledge distillation (KD) is a promising solution, enabling the transfer of capabilities from larger teacher LLMs to more compact student models. Particularly, sequence-level KD, which distills rationale-based reasoning processes instead of merely final outcomes, shows great potential in enhancing students' reasoning capabilities. However, current methods struggle with sequence level KD under long-tailed data distributions, adversely affecting generalization on sparsely represented domains. We introduce the Multi-Stage Balanced Distillation (BalDistill) framework, which iteratively balances training data within a fixed computational budget. By dynamically selecting representative head domain examples and synthesizing tail domain examples, BalDistill achieves state-of-the-art performance across diverse long-tailed datasets, enhancing both the efficiency and efficacy of the distilled models.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Yuhang Zhou (52 papers)
  2. Jing Zhu (50 papers)
  3. Paiheng Xu (14 papers)
  4. Xiaoyu Liu (138 papers)
  5. Xiyao Wang (26 papers)
  6. Danai Koutra (70 papers)
  7. Wei Ai (48 papers)
  8. Furong Huang (150 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets