Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Sparse Progressive Distillation: Resolving Overfitting under Pretrain-and-Finetune Paradigm (2110.08190v4)

Published 15 Oct 2021 in cs.CL

Abstract: Conventional wisdom in pruning Transformer-based LLMs is that pruning reduces the model expressiveness and thus is more likely to underfit rather than overfit. However, under the trending pretrain-and-finetune paradigm, we postulate a counter-traditional hypothesis, that is: pruning increases the risk of overfitting when performed at the fine-tuning phase. In this paper, we aim to address the overfitting problem and improve pruning performance via progressive knowledge distillation with error-bound properties. We show for the first time that reducing the risk of overfitting can help the effectiveness of pruning under the pretrain-and-finetune paradigm. Ablation studies and experiments on the GLUE benchmark show that our method outperforms the leading competitors across different tasks.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Shaoyi Huang (19 papers)
  2. Dongkuan Xu (43 papers)
  3. Ian E. H. Yen (8 papers)
  4. Yijue Wang (6 papers)
  5. Bingbing Li (24 papers)
  6. Shiyang Chen (23 papers)
  7. Mimi Xie (14 papers)
  8. Sanguthevar Rajasekaran (21 papers)
  9. Hang Liu (135 papers)
  10. Caiwen Ding (98 papers)
  11. Sung-En Chang (10 papers)
Citations (26)