Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training (2405.15319v2)

Published 24 May 2024 in cs.CL and cs.AI

Abstract: LLMs are computationally expensive to pre-train due to their large scale. Model growth emerges as a promising approach by leveraging smaller models to accelerate the training of larger ones. However, the viability of these model growth methods in efficient LLM pre-training remains underexplored. This work identifies three critical $\underline{\textit{O}}$bstacles: ($\textit{O}$1) lack of comprehensive evaluation, ($\textit{O}$2) untested viability for scaling, and ($\textit{O}$3) lack of empirical guidelines. To tackle $\textit{O}$1, we summarize existing approaches into four atomic growth operators and systematically evaluate them in a standardized LLM pre-training setting. Our findings reveal that a depthwise stacking operator, called $G_{\text{stack}}$, exhibits remarkable acceleration in training, leading to decreased loss and improved overall performance on eight standard NLP benchmarks compared to strong baselines. Motivated by these promising results, we conduct extensive experiments to delve deeper into $G_{\text{stack}}$ to address $\textit{O}$2 and $\textit{O}$3. For $\textit{O}$2 (untested scalability), our study shows that $G_{\text{stack}}$ is scalable and consistently performs well, with experiments up to 7B LLMs after growth and pre-training LLMs with 750B tokens. For example, compared to a conventionally trained 7B model using 300B tokens, our $G_{\text{stack}}$ model converges to the same loss with 194B tokens, resulting in a 54.6\% speedup. We further address $\textit{O}$3 (lack of empirical guidelines) by formalizing guidelines to determine growth timing and growth factor for $G_{\text{stack}}$, making it practical in general LLM pre-training. We also provide in-depth discussions and comprehensive ablation studies of $G_{\text{stack}}$. Our code and pre-trained model are available at https://LLM-stacking.github.io.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Wenyu Du (21 papers)
  2. Tongxu Luo (9 papers)
  3. Zihan Qiu (19 papers)
  4. Zeyu Huang (31 papers)
  5. Yikang Shen (62 papers)
  6. Reynold Cheng (31 papers)
  7. Yike Guo (144 papers)
  8. Jie Fu (229 papers)
Citations (4)

Summary

Efficient Pre-training of LLMs Using Depthwise Stacking Operator

This essay provides an in-depth analysis of a paper focusing on improving the efficiency of pre-training LLMs using a depthwise stacking growth operator, denoted as GstackG_{\text{stack}}. The paper addresses three primary obstacles in the domain of model growth methods: the lack of comprehensive evaluation, untested scalability, and the absence of empirical guidelines.

Summary of Contributions

The paper systematically evaluates various model growth techniques and identifies that depthwise stacking GstackG_{\text{stack}} significantly accelerates the pre-training process compared to training models from scratch. Key findings of the paper include:

  1. Comprehensive evaluation of model growth operators.
  2. Validation of the GstackG_{\text{stack}} operator's scalability.
  3. Establishment of guidelines for the practical application of GstackG_{\text{stack}} in LLM pre-training.

Evaluation of Model Growth Techniques

The authors categorize existing model growth techniques into four atomic growth operators: direct duplication (GdirectG_{\text{direct}}), learnable parameter expansion (GlearnG_{\text{learn}}), zero initialization (GzeroG_{\text{zero}}), and random initialization (GrandomG_{\text{random}}). Each operator is evaluated for scalability in both depthwise and widthwise dimensions. The evaluation reveals that depthwise stacking GstackG_{\text{stack}} consistently outperforms other operators across multiple benchmarks.

Scalability and Efficiency

The authors conduct extensive experiments to test the scalability of the GstackG_{\text{stack}} operator:

  1. Model Size Scaling: Experiments with model sizes up to 7B parameters and training data up to 300B tokens show that GstackG_{\text{stack}} maintains its efficiency, achieving a 54.5% speedup in pre-training for the 3B model and similar gains for larger models.
  2. Training Token Scaling: Pre-training a 410M LLM with 750B tokens demonstrates that GstackG_{\text{stack}} achieves continuous acceleration, indicating its potential for long-duration training tasks.

Practical Guidelines

The paper addresses the lack of empirical guidelines for model growth by estimating the optimal growth timing (dd) and growth factor (gg):

  1. Growth Timing (dd): The authors fit a logarithmic function to determine optimal dd based on model parameters and computational budget, generally finding that a value between 10B and 20B tokens optimizes efficiency.
  2. Growth Factor (gg): Experiments suggest an optimal growth factor between 2 and 4, with a constant factor of 4 recommended for practical applications.

Implications and Future Research

The findings have significant implications for both the theoretical and practical aspects of LLM pre-training. The demonstrated scalability of the GstackG_{\text{stack}} operator suggests that this method can be effectively applied to very large models and extensive training datasets, which is critical as model sizes continue to grow.

Future research could focus on:

  1. Further Exploration of Growth Strategies: Investigate more sophisticated growth strategies beyond depthwise stacking to identify methods that could offer even greater efficiency.
  2. Longitudinal Studies: Conduct longer-term experiments with a wider range of model sizes and training data to solidify the practical guidelines and generalize findings.
  3. Function Preservation and Noise Introduction: Explore the role of function preservation in model growth, as initial findings indicate that controlled introduction of noise can sometimes improve performance.

Conclusion

This paper presents a thorough and systematic evaluation of model growth techniques, with a particular focus on the depthwise stacking operator GstackG_{\text{stack}}. By addressing key obstacles in the efficient pre-training of LLMs, the authors provide valuable insights and practical guidelines that can significantly enhance the pre-training process, offering a noteworthy contribution to the field of generative AI and LLM research.