Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CoLLiE: Collaborative Training of Large Language Models in an Efficient Way (2312.00407v1)

Published 1 Dec 2023 in cs.CL

Abstract: LLMs are increasingly pivotal in a wide range of natural language processing tasks. Access to pre-trained models, courtesy of the open-source community, has made it possible to adapt these models to specific applications for enhanced performance. However, the substantial resources required for training these models necessitate efficient solutions. This paper introduces CoLLiE, an efficient library that facilitates collaborative training of LLMs using 3D parallelism, parameter-efficient fine-tuning (PEFT) methods, and optimizers such as Lion, Adan, Sophia, LOMO and AdaLomo. With its modular design and comprehensive functionality, CoLLiE offers a balanced blend of efficiency, ease of use, and customization. CoLLiE has proven superior training efficiency in comparison with prevalent solutions in pre-training and fine-tuning scenarios. Furthermore, we provide an empirical evaluation of the correlation between model size and GPU memory consumption under different optimization methods, as well as an analysis of the throughput. Lastly, we carry out a comprehensive comparison of various optimizers and PEFT methods within the instruction-tuning context. CoLLiE is available at https://github.com/OpenLMLab/collie.

CoLLiE: Collaborative Training of LLMs in an Efficient Way

The paper introduces CoLLiE, a library designed for efficiently facilitating the collaborative training of LLMs. With the increasing computational demands posed by expanding model sizes, the efficient utilization of resources is paramount. CoLLiE aims to address this through 3D parallelism, parameter-efficient fine-tuning (PEFT) methods, and an array of optimizers such as Lion, Adan, Sophia, LOMO, and AdaLomo.

Key Features and Contributions

  1. 3D Parallelism: CoLLiE leverages tensor parallelism, pipeline parallelism, and ZeRO-3. This integrated approach enables the training of large models by effectively partitioning and distributing workloads across multiple GPUs.
  2. Parameter-efficient Fine-tuning: PEFT methods incorporated into CoLLiE, such as LoRA and prompt-tuning, allow for selective training of model parameters, facilitating memory efficiency.
  3. Optimizer Integration: The library is equipped with several optimizers tailored for LLM training, enhancing memory conservation and achieving faster convergence. A notable inclusion is the LOMO optimizer, known for minimizing memory usage by not retaining any optimizer states.
  4. FlashAttention: CoLLiE integrates FlashAttention to improve computational efficiency during training, significantly boosting throughput.
  5. Modular Design: The architecture of CoLLiE promotes extensibility, coupling ease of customization with a user-friendly configuration interface through the CollieConfig class.

Performance Assessment

The numerical results in the paper illustrate CoLLiE's superior training efficiency across various dimensions:

  • Memory Requirements: The paper profiles GPU memory usage, finding substantial reductions especially when employing optimizers like LOMO and PEFT methods, reducing memory consumption to approximately 2.1 times the model parameters' size.
  • Throughput: Experiments conducted show CoLLiE achieves significant throughput advantages over prevalent solutions, particularly on hardware with communication bottlenecks. This is notably attributed to the combination of TP and PP strategies.
  • Empirical Validation: By instruction-tuning a LLaMA-65B using CoLLiE, the research highlights significant performance improvements across tasks related to factual knowledge and instruction-following capabilities.

Implications and Future Work

The practical implications of CoLLiE are extensive for NLP researchers and practitioners. By enabling more efficient training of large models, CoLLiE allows for experimentation with larger models in resource-constrained environments. The potential for future research includes fine-grained profiling of memory allocation and extending the empirical evaluations across diverse model scales and training methodologies.

Conclusion

CoLLiE presents a comprehensive solution to the challenges of training LLMs efficiently. With robust support for 3D parallelism, innovative fine-tuning methods, and a suite of novel optimizers, CoLLiE positions itself as a valuable tool for advancing the capabilities of LLMs in practical and efficient ways. By addressing both scalability and efficiency, CoLLiE opens avenues for significant contributions to the field of AI and machine learning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Colossal-ai: A unified deep learning system for large-scale parallel training. CoRR, abs/2110.14883.
  2. Evaluating large language models trained on code. CoRR, abs/2107.03374.
  3. Symbolic discovery of optimization algorithms. CoRR, abs/2302.06675.
  4. Training verifiers to solve math word problems. CoRR, abs/2110.14168.
  5. Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning. CoRR, abs/2307.08691.
  6. Flashattention: Fast and memory-efficient exact attention with io-awareness. In NeurIPS.
  7. Lmflow: An extensible toolkit for finetuning and inference of large foundation models. CoRR, abs/2306.12420.
  8. Parameter-efficient fine-tuning of large-scale pre-trained language models. Nat. Mac. Intell., 5(3):220–235.
  9. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335.
  10. Alpacafarm: A simulation framework for methods that learn from human feedback. CoRR, abs/2305.14387.
  11. The pile: An 800gb dataset of diverse text for language modeling. CoRR, abs/2101.00027.
  12. Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
  13. Parameter-efficient transfer learning for NLP. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 2790–2799. PMLR.
  14. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
  15. Gpipe: Efficient training of giant neural networks using pipeline parallelism. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 103–112.
  16. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
  17. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 3045–3059. Association for Computational Linguistics.
  18. Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pages 4582–4597. Association for Computational Linguistics.
  19. Sophia: A scalable stochastic second-order optimizer for language model pre-training. CoRR, abs/2305.14342.
  20. Adalomo: Low-memory optimization with adaptive learning rate. CoRR, abs/2310.10195.
  21. Full parameter fine-tuning for large language models with limited resources. CoRR, abs/2306.09782.
  22. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft.
  23. Pipedream: generalized pipeline parallelism for DNN training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP 2019, Huntsville, ON, Canada, October 27-30, 2019, pages 1–15. ACM.
  24. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 8024–8035.
  25. Instruction tuning with GPT-4. CoRR, abs/2304.03277.
  26. Zero: memory optimizations toward training trillion parameter models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020, Virtual Event / Atlanta, Georgia, USA, November 9-19, 2020, page 20. IEEE/ACM.
  27. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23-27, 2020, pages 3505–3506. ACM.
  28. BLOOM: A 176b-parameter open-access multilingual language model. CoRR, abs/2211.05100.
  29. Megatron-lm: Training multi-billion parameter language models using model parallelism. CoRR, abs/1909.08053.
  30. Moss: Training conversational language models from synthetic data. https://github.com/OpenLMLab/MOSS.
  31. A comparative study between full-parameter and lora-based fine-tuning on chinese instruction data for instruction following large language model. CoRR, abs/2304.08109.
  32. Challenging big-bench tasks and whether chain-of-thought can solve them. In Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, pages 13003–13051. Association for Computational Linguistics.
  33. InternLM Team. 2023. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM.
  34. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971.
  35. How far can camels go? exploring the state of instruction tuning on open resources. CoRR, abs/2306.04751.
  36. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
  37. Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models. CoRR, abs/2208.06677.
  38. OPT: open pre-trained transformer language models. CoRR, abs/2205.01068.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (14)
  1. Kai Lv (20 papers)
  2. Shuo Zhang (256 papers)
  3. Tianle Gu (14 papers)
  4. Shuhao Xing (3 papers)
  5. Jiawei Hong (5 papers)
  6. Keyu Chen (76 papers)
  7. Xiaoran Liu (56 papers)
  8. Yuqing Yang (83 papers)
  9. Honglin Guo (8 papers)
  10. Tengxiao Liu (7 papers)
  11. Yu Sun (226 papers)
  12. Qipeng Guo (72 papers)
  13. Hang Yan (86 papers)
  14. Xipeng Qiu (257 papers)
Citations (5)
Github Logo Streamline Icon: https://streamlinehq.com