Checkpoint Merging via Bayesian Optimization in LLM Pretraining (2403.19390v1)
Abstract: The rapid proliferation of LLMs such as GPT-4 and Gemini underscores the intense demand for resources during their training processes, posing significant challenges due to substantial computational and environmental costs. To alleviate this issue, we propose checkpoint merging in pretraining LLM. This method utilizes LLM checkpoints with shared training trajectories, and is rooted in an extensive search space exploration for the best merging weight via Bayesian optimization. Through various experiments, we demonstrate that: (1) Our proposed methodology exhibits the capacity to augment pretraining, presenting an opportunity akin to obtaining substantial benefits at minimal cost; (2) Our proposed methodology, despite requiring a given held-out dataset, still demonstrates robust generalization capabilities across diverse domains, a pivotal aspect in pretraining.
- Pythia: A suite for analyzing large language models across training and scaling.
- A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599.
- Checkpoint ensembles: Ensemble methods from a single training process. arXiv preprint arXiv:1710.03282.
- Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
- Training verifiers to solve math word problems.
- Deepseek llm: Scaling open-source language models with longtermism.
- Llmcarbon: Modeling the end-to-end carbon footprint of large language models. arXiv preprint arXiv:2309.14393.
- Peter I. Frazier. 2018. A tutorial on bayesian optimization.
- Clive WJ Granger. 1989. Invited review combining forecasts—twenty years later. Journal of forecasting, 8(3):167–173.
- Olmo: Accelerating the science of language models.
- Measuring massive multitask language understanding.
- Portfolio allocation for bayesian optimization. In UAI, pages 327–336.
- Token dropping for efficient BERT pretraining. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3774–3784, Dublin, Ireland. Association for Computational Linguistics.
- C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. In Advances in Neural Information Processing Systems.
- Dataless knowledge fusion by merging weights of language models. arXiv preprint arXiv:2212.09849.
- Population parameter averaging (papa). arXiv preprint arXiv:2304.03094.
- Efficient global optimization of expensive black-box functions. Journal of Global optimization, 13:455–492.
- Solar 10.7 b: Scaling large language models with simple yet effective depth up-scaling. arXiv preprint arXiv:2312.15166.
- Cmmlu: Measuring massive multitask language understanding in chinese.
- Tune: A research platform for distributed model selection and training. arXiv preprint arXiv:1807.05118.
- Hanayo: Harnessing wave-like pipeline parallelism for enhanced large model training efficiency. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–13.
- Michael S Matena and Colin A Raffel. 2022. Merging models with fisher-weighted averaging. Advances in Neural Information Processing Systems, 35:17703–17716.
- Jonas Močkus. 1975. On bayesian methods for seeking the extremum. In Optimization Techniques IFIP Technical Conference: Novosibirsk, July 1–7, 1974, pages 400–404. Springer.
- Gpt-4 technical report.
- R OpenAI. 2023. Gpt-4 technical report. arxiv 2303.08774. View in Article, 2:13.
- Elle: Efficient lifelong pre-training for emerging data. arXiv preprint arXiv:2203.06311.
- Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE.
- Warm: On the benefits of weight averaged reward models.
- Sebastian Ruder and Barbara Plank. 2017. Learning to select data for transfer learning with bayesian optimization. arXiv preprint arXiv:1707.05246.
- Early weight averaging meets high learning rates for llm pre-training.
- Matthias Seeger. 2004. Gaussian processes for machine learning. International journal of neural systems, 14(02):69–106.
- Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053.
- Interactive text ranking with bayesian optimization: A case study on community qa and summarization. Transactions of the Association for Computational Linguistics, 8:759–775.
- Sidak Pal Singh and Martin Jaggi. 2023. Model fusion via optimal transport.
- Gaussian process optimization in the bandit setting: No regret and experimental design. arXiv preprint arXiv:0912.3995.
- Energy and policy considerations for deep learning in nlp. arXiv preprint arXiv:1906.02243.
- Gemini: A family of highly capable multimodal models.
- Llama: Open and efficient foundation language models.
- Joachim Utans. 1996. Weight averaging for neural networks and local resampling schemes. In Proc. AAAI-96 Workshop on Integrating Multiple Learned Models. AAAI Press, pages 133–138. Citeseer.
- Knowledge fusion of large language models.
- Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning, pages 23965–23998. PMLR.
- Baichuan 2: Open large-scale language models.
- Bayesian optimization of text representations. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 2100–2105, Lisbon, Portugal. Association for Computational Linguistics.
- Language models are super mario: Absorbing abilities from homologous models as a free lunch.