Model Merging in Pre-training of Large Language Models (2505.12082v3)

Published 17 May 2025 in cs.CL and cs.LG

Abstract: Model merging has emerged as a promising technique for enhancing LLMs, though its application in large-scale pre-training remains relatively unexplored. In this paper, we present a comprehensive investigation of model merging techniques during the pre-training process. Through extensive experiments with both dense and Mixture-of-Experts (MoE) architectures ranging from millions to over 100 billion parameters, we demonstrate that merging checkpoints trained with constant learning rates not only achieves significant performance improvements but also enables accurate prediction of annealing behavior. These improvements lead to both more efficient model development and significantly lower training costs. Our detailed ablation studies on merging strategies and hyperparameters provide new insights into the underlying mechanisms while uncovering novel applications. Through comprehensive experimental analysis, we offer the open-source community practical pre-training guidelines for effective model merging.

Authors (26)

Yunshui Li (18 papers)
Yiyuan Ma (4 papers)
Shen Yan (47 papers)
Chaoyi Zhang (51 papers)
Jing Liu (526 papers)
Jianqiao Lu (20 papers)
Ziwen Xu (16 papers)
Mengzhao Chen (19 papers)
Minrui Wang (2 papers)
Shiyi Zhan (1 paper)
Jin Ma (64 papers)
Xunhao Lai (4 papers)
Yao Luo (27 papers)
Xingyan Bin (6 papers)
Hongbin Ren (3 papers)
Mingji Han (4 papers)
Wenhao Hao (2 papers)
Bairen Yi (4 papers)
Lingjun Liu (13 papers)
Bole Ma (6 papers)

Summary

The paper presents a comprehensive paper on incorporating model merging techniques during the pre-training process of LLMs. Its key innovation is the introduction of Pre-trained Model Averaging (PMA), a strategy that merges several intermediate checkpoints from a single pre-training trajectory to improve model performance, accelerate training, and enhance overall training stability.

Main Contributions:

Pre-trained Model Averaging (PMA):

The authors introduce PMA as a framework to merge checkpoints obtained during the stable training phase of pre-training. By averaging weights from models trained with a constant learning rate (without or before the cosine decay phase), the merged model can simulate the effects of annealing, yield performance comparable to annealed counterparts, and reduce computational costs.

Merging Strategies and Hyperparameter Analysis:
- Simple Moving Average (SMA): All checkpoints are weighted equally.
- Exponential Moving Average (EMA): Recent checkpoints are emphasized using exponentially decaying weights.
- Weighted Moving Average (WMA): Checkpoints are assigned linearly increasing weights to prioritize later training stages.
- Experiments show that while all methods improve performance, WMA tends to deliver the best results in the early training stages; however, as training stabilizes, the differences become minimal. Ablation studies also determine optimal intervals (token consumption gaps) and the number of merged checkpoints across different model sizes.
Downstream Benefits and Training Stability:

Beyond improving pre-training performance, the paper shows that using merged weights to initialize downstream stages—continual training (CT) and supervised fine-tuning (SFT)—can lead to smoother gradient norm trajectories and stabilize training overall. In fact, the PMA-init method is demonstrated as a recovery strategy when models encounter loss spikes or diverging behavior, thus preventing retraining from scratch.

Theoretical Insights:

A detailed mathematical analysis based on a second-order Taylor expansion of the loss function illustrates why merging can lead to lower loss. The paper shows that if the deviations from the optimal parameter set have complementary (negatively correlated) behaviors with respect to the Hessian of the loss function, then averaging these checkpoints can yield a model closer to the optimum.

Extensive Empirical Validation:

The authors conducted experiments on both Mixture-of-Experts (MoE) and dense model architectures, with models ranging from hundreds of millions to over 100 billion parameters. Results across numerous open benchmarks (including MMLU, HumanEval, GSM8K, and more) consistently demonstrate that model merging improves performance. In addition, experiments comparing models trained with constant learning rates versus cosine annealing reveal that merging with constant rates can match or even surpass naturally annealed models.

Experimental Findings:

Pre-training Improvements:

Merging during the stable phase consistently improves downstream task performance—even for very large models. In some cases, early merged models match the performance of final-stage annealed models, allowing for more efficient validation cycles.

Optimal Hyperparameters:

Ablation studies confirm that the merging interval should scale with model size (e.g., smaller intervals for smaller models and larger ones for larger models) and that incorporating more checkpoints (up to a point) improves performance once training converges.

Downstream Initialization (PMA-init):

Using PMA-initialized weights for CT and SFT results in lower initial loss values and more stable gradient norms. In scenarios of catastrophic loss spikes, PMA-init has been effective in resuming and stabilizing the training process.

Mechanistic Insights:

Visualization of weight distributions and contour plots of performance metrics (like MMLU scores) provides an intuitive explanation: individual checkpoints explore different regions of the parameter space, and their complementary nature can lead the averaged model to reside in a region with higher performance.

Overall Impact and Practical Implications:

The paper not only offers empirical evidence for the benefits of model merging during pre-training but also provides detailed guidelines for selecting merging strategies and hyperparameters. By using PMA, practitioners can potentially avoid the expensive cosine annealing phase, accelerate model development, and improve the reliability and stability of the training process. This work is particularly valuable for teams working on large-scale LLM pre-training, as it reduces computational costs while maintaining or even enhancing model performance.

In summary, the paper makes a significant contribution by bridging the gap between post-training model merging methods and those applicable during large-scale pre-training. It equips researchers and practitioners with both theoretical insights and practical guidelines to effectively implement model merging, ultimately enabling more efficient and robust development of LLMs.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/papers_anon/status/1924711118196666564

https://twitter.com/cloud2water/status/1925730766396244427

https://twitter.com/HuggingPapers/status/1926250472714695106

https://twitter.com/bohannon_bot/status/1925221459942891804

https://twitter.com/ShauryaSharthak/status/1926328700607300067

https://twitter.com/kimbochen/status/1933983530083594319

HackerNews

Model Merging in Pre-Training of Large Language Models (2 points, 0 comments)