Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Attention Temperature Matters in Abstractive Summarization Distillation (2106.03441v3)

Published 7 Jun 2021 in cs.CL

Abstract: Recent progress of abstractive text summarization largely relies on large pre-trained sequence-to-sequence Transformer models, which are computationally expensive. This paper aims to distill these large models into smaller ones for faster inference and minimal performance loss. Pseudo-labeling based methods are popular in sequence-to-sequence model distillation. In this paper, we find simply manipulating attention temperatures in Transformers can make pseudo labels easier to learn for student models. Our experiments on three summarization datasets show our proposed method consistently improves over vanilla pseudo-labeling based methods. We also find that both the pseudo labels and summaries produced by our students are shorter and more abstractive. Our code is available at \url{https://github.com/Shengqiang-Zhang/plate}.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Shengqiang Zhang (6 papers)
  2. Xingxing Zhang (65 papers)
  3. Hangbo Bao (17 papers)
  4. Furu Wei (291 papers)
Citations (17)