Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EarlyBERT: Efficient BERT Training via Early-bird Lottery Tickets (2101.00063v2)

Published 31 Dec 2020 in cs.CL and cs.AI

Abstract: Heavily overparameterized LLMs such as BERT, XLNet and T5 have achieved impressive success in many NLP tasks. However, their high model complexity requires enormous computation resources and extremely long training time for both pre-training and fine-tuning. Many works have studied model compression on large NLP models, but only focusing on reducing inference time while still requiring an expensive training process. Other works use extremely large batch sizes to shorten the pre-training time, at the expense of higher computational resource demands. In this paper, inspired by the Early-Bird Lottery Tickets recently studied for computer vision tasks, we propose EarlyBERT, a general computationally-efficient training algorithm applicable to both pre-training and fine-tuning of large-scale LLMs. By slimming the self-attention and fully-connected sub-layers inside a transformer, we are the first to identify structured winning tickets in the early stage of BERT training. We apply those tickets towards efficient BERT training, and conduct comprehensive pre-training and fine-tuning experiments on GLUE and SQuAD downstream tasks. Our results show that EarlyBERT achieves comparable performance to standard BERT, with 35~45% less training time. Code is available at https://github.com/VITA-Group/EarlyBERT.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Xiaohan Chen (30 papers)
  2. Yu Cheng (354 papers)
  3. Shuohang Wang (69 papers)
  4. Zhe Gan (135 papers)
  5. Zhangyang Wang (375 papers)
  6. Jingjing Liu (139 papers)
Citations (92)

Summary

We haven't generated a summary for this paper yet.