Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On the Transformer Growth for Progressive BERT Training (2010.12562v3)

Published 23 Oct 2020 in cs.CL and cs.LG

Abstract: Due to the excessive cost of large-scale LLM pre-training, considerable efforts have been made to train BERT progressively -- start from an inferior but low-cost model and gradually grow the model to increase the computational complexity. Our objective is to advance the understanding of Transformer growth and discover principles that guide progressive training. First, we find that similar to network architecture search, Transformer growth also favors compound scaling. Specifically, while existing methods only conduct network growth in a single dimension, we observe that it is beneficial to use compound growth operators and balance multiple dimensions (e.g., depth, width, and input length of the model). Moreover, we explore alternative growth operators in each dimension via controlled comparison to give operator selection practical guidance. In light of our analyses, the proposed method speeds up BERT pre-training by 73.6% and 82.2% for the base and large models respectively, while achieving comparable performances

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Xiaotao Gu (32 papers)
  2. Liyuan Liu (49 papers)
  3. Hongkun Yu (17 papers)
  4. Jing Li (621 papers)
  5. Chen Chen (753 papers)
  6. Jiawei Han (263 papers)
Citations (45)