Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Masked Structural Growth for 2x Faster Language Model Pre-training (2305.02869v3)

Published 4 May 2023 in cs.CL

Abstract: Accelerating LLM pre-training is a critical issue in present research. In this paper, we focus on speeding up pre-training by progressively growing from a small Transformer structure to a large one. There are two main research problems associated with progressive growth: determining the optimal growth schedule, and designing efficient growth operators. In terms of growth schedule, the impact of each single dimension on a schedule's efficiency is under-explored by existing work. Regarding the growth operators, existing methods rely on the initialization of new weights to inherit knowledge, and achieve only non-strict function preservation, limiting further improvements on training dynamics. To address these issues, we propose Masked Structural Growth (MSG), including (i) growth schedules involving all possible dimensions and (ii) strictly function-preserving growth operators that is independent of the initialization of new weights. Experiments show that MSG is significantly faster than related work: we achieve up to 2.2x speedup in pre-training different types of LLMs while maintaining comparable or better downstream performances. Code is publicly available at https://github.com/cofe-ai/MSG.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Yiqun Yao (14 papers)
  2. Zheng Zhang (488 papers)
  3. Jing Li (621 papers)
  4. Yequan Wang (44 papers)
Citations (13)
Github Logo Streamline Icon: https://streamlinehq.com