Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Future of Large Language Model Pre-training is Federated (2405.10853v3)

Published 17 May 2024 in cs.LG, cs.AI, and cs.DC
The Future of Large Language Model Pre-training is Federated

Abstract: Generative pre-trained LLMs have demonstrated impressive performance over a wide range of tasks, thanks to the unprecedented amount of data they have been trained on. As established scaling laws indicate, LLMs' future performance improvement depends on the amount of computing and data sources they can leverage for pre-training. Federated learning (FL) has the potential to unleash the majority of the planet's data and computational resources, which are underutilized by the data-center-focused training methodology of current LLM practice. Our work presents a robust, flexible, reproducible FL approach that enables large-scale collaboration across institutions to train LLMs. We propose a scalable deployment system called Photon to enable the investigation and development of this new training paradigm for LLM pre-training. We show that Photon can be used by organizations interested in collaborating with their private data sources and computational resources for pre-training LLMs with billions of parameters. This paradigm would mobilize more computational and data resources while matching or potentially exceeding centralized performance. We further show the effectiveness of the federated training scales with model size and present our approach for training billion-scale federated LLMs using limited resources. Thus far, we have used Photon to train LLM models to the size of 7B parameters and anticipate larger models being completed in the near future. Finally, we show that LLM training is highly resilient to the classical challenges of federated statistical and hardware heterogeneity. Furthermore, we show that convergence is robust to partial participation, opening the avenue for compute-efficient collaborative training. Photon will help data-rich actors to become the protagonists of LLMs pre-training instead of leaving the stage to compute-rich actors alone.

The Future of LLM Pre-training is Federated

The paper "The Future of LLM Pre-training is Federated" presents a pivotal shift in the paradigm of training LLMs by leveraging federated learning (FL). The authors propose that the most effective means of improving LLM performance is to move away from the current centralized, compute-intensive model training methodology and adopt a federated, collaborative approach. This shift aims to democratize access to LLM training by harnessing the underutilized data and computational resources distributed globally.

Core Contributions

The principal contributions of the paper are the development of a robust, flexible, and reproducible federated learning framework that can facilitate the training of LLMs on a global scale:

  1. Federated Learning for LLMs: The authors present a federated approach to training LLMs, enabling collaborative utilization of data and computational resources across various institutions. This method not only matches but potentially exceeds the performance of centralized training methodologies.
  2. Scalability: The paper documents the successful training of LLMs of up to 1.3 billion parameters using the federated approach. This is the first recorded instance of generative pre-training of a billion-scale model within a heterogeneous federated setting.
  3. Communication Efficiency: The federated learning strategy significantly reduces communication overhead compared to traditional centralized methods, making it feasible for institutions with limited computing power and less powerful network infrastructures to participate.
  4. Broad Hardware Inclusivity: The technique accommodates participants with diverse hardware capabilities, ranging from powerful GPUs to standard cloud-based setups with single GPUs.
  5. Empirical Validation: Extensive experiments validate the model's efficacy and performance, demonstrating that larger federated models reach consensus more easily and efficiently compared to smaller ones.

Methodological Insights

The research explores several key dimensions of federated training:

  • Data and Model Parallelism: The federated approach leverages both data and model parallelism to distribute the training load across multiple nodes. This distribution reduces the memory load on individual GPUs and aligns with the scalability goals.
  • Local SGD and Gradients: By employing local stochastic gradient descent (SGD), the federated framework mitigates the need for synchronous updates and reduces the communication burden. The results showed that federating the optimization process helps align client models towards a global optimum more effectively.
  • Memory and Computation Management: Techniques such as activation checkpointing and CPU offloading are employed to manage the memory and computational requirements of the training process, making it accessible to a wider range of hardware configurations.

Experimental Validation

The authors conducted rigorous experiments using a variant of the C4 dataset, randomly split across eight clients. Key findings include:

  • Larger Models Achieve Better Consensus: The research indicates that federated optimization improves convergence and performance as the model size increases. For example, the convergence phase for a 1.3 billion parameter model occurs much more rapidly compared to a smaller 75 million parameter model.
  • Performance Comparisons: When comparing the federated approach to centralized training, larger federated models demonstrated performance parity with centralized models, proving the feasibility and effectiveness of federated learning for large-scale LLMs.

Implications and Future Research

The implications of this research are profound:

  • Democratization of LLM Training: By enabling entities with significant data but limited computational resources to participate in LLM training, the federated approach democratizes access to high-quality LLMs.
  • Privacy-Preserving Data Utilization: FL inherently supports privacy-preserving techniques, making it possible to utilize sensitive data without compromising privacy.
  • Scalability and Data Sources: The ability to aggregate diverse data sources can enhance model generalization and reduce biases inherent in models trained on limited data sources.

Future research directions proposed in the paper include further optimizing the federated training framework, scaling up both the population of clients and the size of the models, and exploring the impact of data heterogeneity on model performance. Moreover, fine-tuning the federated models on established benchmark tasks will provide deeper insights into their utility across a broad range of applications.

Conclusion

This paper advances the field of LLM pre-training by introducing and validating a federated learning framework that democratizes access to model training capabilities. This framework leverages the untapped data and computational resources distributed worldwide, proving that a collaborative approach can match and potentially surpass the performance of centralized methodologies. The authors' rigorous empirical validation and thoughtful consideration of future work make a compelling case for federated learning as a sustainable and inclusive path forward in AI development.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (76)
  1. Scaling laws for neural language models. CoRR, abs/2001.08361, 2020. URL https://arxiv.org/abs/2001.08361.
  2. Monthly energy consumption forecast: A deep learning approach. In 2017 International Joint Conference on Neural Networks (IJCNN), pages 4283–4290, 2017.
  3. Communication-efficient distributed learning: An overview. IEEE J. Sel. Areas Commun., 41(4):851–873, 2023. doi: 10.1109/JSAC.2023.3242710. URL https://doi.org/10.1109/JSAC.2023.3242710.
  4. The cost of training machine learning models over distributed data sources. IEEE Open J. Commun. Soc., 4:1111–1126, 2023. doi: 10.1109/OJCOMS.2023.3274394. URL https://doi.org/10.1109/OJCOMS.2023.3274394.
  5. Training compute-optimal large language models. CoRR, abs/2203.15556, 2022. doi: 10.48550/ARXIV.2203.15556. URL https://doi.org/10.48550/arXiv.2203.15556.
  6. The times sues openai and microsoft over a.i. use of copyrighted work, Dec 2023. URL https://www.nytimes.com/2023/12/27/business/media/new-york-times-open-ai-microsoft-lawsuit.html.
  7. Trusted source alignment in large language models. CoRR, abs/2311.06697, 2023. doi: 10.48550/ARXIV.2311.06697. URL https://doi.org/10.48550/arXiv.2311.06697.
  8. The curse of recursion: Training on generated data makes models forget. CoRR, abs/2305.17493, 2023. doi: 10.48550/ARXIV.2305.17493. URL https://doi.org/10.48550/arXiv.2305.17493.
  9. An archival perspective on pretraining data. Patterns (N Y), 5(4):100966, March 2024.
  10. Considerations for differentially private learning with large-scale public pretraining. CoRR, abs/2212.06470, 2022. doi: 10.48550/ARXIV.2212.06470. URL https://doi.org/10.48550/arXiv.2212.06470.
  11. OpenAI, Dec 2023a. URL https://openai.com/blog/axel-springer-partnership.
  12. Will we run out of data? an analysis of the limits of scaling datasets in machine learning. CoRR, abs/2211.04325, 2022. doi: 10.48550/ARXIV.2211.04325. URL https://doi.org/10.48550/arXiv.2211.04325.
  13. Securing large language models: Threats, vulnerabilities and responsible practices. CoRR, abs/2403.12503, 2024. doi: 10.48550/ARXIV.2403.12503. URL https://doi.org/10.48550/arXiv.2403.12503.
  14. Jaydeep Borkar. What can we learn from data leakage and unlearning for law? CoRR, abs/2307.10476, 2023. doi: 10.48550/ARXIV.2307.10476. URL https://doi.org/10.48550/arXiv.2307.10476.
  15. Federated foundation models: Privacy-preserving and collaborative learning for large models. CoRR, abs/2305.11414, 2023. doi: 10.48550/ARXIV.2305.11414. URL https://doi.org/10.48550/arXiv.2305.11414.
  16. Sebastian U. Stich. Local SGD converges fast and communicates little. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id=S1g2JnRcFX.
  17. Don’t use large mini-batches, use local SGD. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=B1eyO1BFPr.
  18. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics. PMLR, 2017.
  19. Diloco: Distributed low-communication training of language models. CoRR, abs/2311.08105, 2023. doi: 10.48550/ARXIV.2311.08105. URL https://doi.org/10.48550/arXiv.2311.08105.
  20. Asynchronous local-sgd training for language modeling. CoRR, abs/2401.09135, 2024. doi: 10.48550/ARXIV.2401.09135. URL https://doi.org/10.48550/arXiv.2401.09135.
  21. Dipaco: Distributed path composition. CoRR, abs/2403.10616, 2024. doi: 10.48550/ARXIV.2403.10616. URL https://doi.org/10.48550/arXiv.2403.10616.
  22. Introducing FlowerLLM, 2024. URL https://flower.ai/blog/2024-03-14-introducing-flowerllm/.
  23. Flower: A friendly federated learning research framework. CoRR, abs/2007.14390, 2020. URL https://arxiv.org/abs/2007.14390.
  24. Pollen: High-throughput simulation of federated learning via resource-aware client placement, 2024.
  25. Language models are few-shot learners, 2020.
  26. OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023b. doi: 10.48550/ARXIV.2303.08774. URL https://doi.org/10.48550/arXiv.2303.08774.
  27. Gemini: A family of highly capable multimodal models. CoRR, abs/2312.11805, 2023. doi: 10.48550/ARXIV.2312.11805. URL https://doi.org/10.48550/arXiv.2312.11805.
  28. BLOOM: A 176b-parameter open-access multilingual language model. CoRR, abs/2211.05100, 2022. doi: 10.48550/ARXIV.2211.05100. URL https://doi.org/10.48550/arXiv.2211.05100.
  29. Llama: Open and efficient foundation language models, 2023a.
  30. Llama 2: Open foundation and fine-tuned chat models, 2023b.
  31. The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only, 2023.
  32. An empirical model of large-batch training. CoRR, abs/1812.06162, 2018. URL http://arxiv.org/abs/1812.06162.
  33. Dota 2 with large scale deep reinforcement learning. CoRR, abs/1912.06680, 2019. URL http://arxiv.org/abs/1912.06680.
  34. Pytorch distributed: Experiences on accelerating data parallel training. Proc. VLDB Endow., 13(12):3005–3018, aug 2020. ISSN 2150-8097. doi: 10.14778/3415478.3415530. URL https://doi.org/10.14778/3415478.3415530.
  35. Horovod: fast and easy distributed deep learning in tensorflow. CoRR, abs/1802.05799, 2018. URL http://arxiv.org/abs/1802.05799.
  36. Megatron-lm: Training multi-billion parameter language models using model parallelism. CoRR, abs/1909.08053, 2019. URL http://arxiv.org/abs/1909.08053.
  37. Mesh-tensorflow: Deep learning for supercomputers. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pages 10435–10444, 2018. URL https://proceedings.neurips.cc/paper/2018/hash/3a37abdeefe1dab1b30f7c5c7e581b93-Abstract.html.
  38. Zero: memory optimizations toward training trillion parameter models. In Christine Cuicchi, Irene Qualters, and William T. Kramer, editors, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020, Virtual Event / Atlanta, Georgia, USA, November 9-19, 2020, page 20. IEEE/ACM, 2020. doi: 10.1109/SC41405.2020.00024. URL https://doi.org/10.1109/SC41405.2020.00024.
  39. Pytorch FSDP: experiences on scaling fully sharded data parallel. Proc. VLDB Endow., 16(12):3848–3860, 2023a. doi: 10.14778/3611540.3611569. URL https://www.vldb.org/pvldb/vol16/p3848-huang.pdf.
  40. Training deep nets with sublinear memory cost. CoRR, abs/1604.06174, 2016. URL http://arxiv.org/abs/1604.06174.
  41. Zero-offload: Democratizing billion-scale model training. In Irina Calciu and Geoff Kuenning, editors, 2021 USENIX Annual Technical Conference, USENIX ATC 2021, July 14-16, 2021, pages 551–564. USENIX Association, 2021. URL https://www.usenix.org/conference/atc21/presentation/ren-jie.
  42. Machine learning for synthetic data generation: a review. CoRR, abs/2302.04062, 2023. doi: 10.48550/ARXIV.2302.04062. URL https://doi.org/10.48550/arXiv.2302.04062.
  43. La Javaness. https://lajavaness.medium.com/llm-large-language-model-cost-analysis-d5022bb43e9e, 2023.
  44. The era of 1-bit llms: All large language models are in 1.58 bits. CoRR, abs/2402.17764, 2024. doi: 10.48550/ARXIV.2402.17764. URL https://doi.org/10.48550/arXiv.2402.17764.
  45. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
  46. Distributed inference and fine-tuning of large language models over the internet, 2023.
  47. Faster on-device training using new federated momentum algorithm. CoRR, abs/2002.02090, 2020. URL https://arxiv.org/abs/2002.02090.
  48. Learning differentially private recurrent language models. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=BJ0hF1Z0b.
  49. Practical secure aggregation for federated learning on user-held data. In NIPS Workshop on Private Multi-Party Machine Learning, 2016. URL https://arxiv.org/abs/1611.04482.
  50. Advances and open problems in federated learning. Found. Trends Mach. Learn., 14(1-2):1–210, 2021. doi: 10.1561/2200000083. URL https://doi.org/10.1561/2200000083.
  51. Towards federated learning at scale: System design. In Ameet Talwalkar, Virginia Smith, and Matei Zaharia, editors, Proceedings of Machine Learning and Systems 2019, MLSys 2019, Stanford, CA, USA, March 31 - April 2, 2019. mlsys.org, 2019. URL https://proceedings.mlsys.org/book/271.pdf.
  52. Trade-offs of local SGD at scale: An empirical study. CoRR, abs/2110.08133, 2021. URL https://arxiv.org/abs/2110.08133.
  53. Scaling federated learning for fine-tuning of large language models. In Elisabeth Métais, Farid Meziane, Helmut Horacek, and Epaminondas Kapetanios, editors, Natural Language Processing and Information Systems - 26th International Conference on Applications of Natural Language to Information Systems, NLDB 2021, Saarbrücken, Germany, June 23-25, 2021, Proceedings, volume 12801 of Lecture Notes in Computer Science, pages 15–23. Springer, 2021. doi: 10.1007/978-3-030-80599-9“˙2. URL https://doi.org/10.1007/978-3-030-80599-9\_2.
  54. ALBERT: A lite BERT for self-supervised learning of language representations. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=H1eA7AEtvS.
  55. BERT: pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics, 2019. doi: 10.18653/V1/N19-1423. URL https://doi.org/10.18653/v1/n19-1423.
  56. Performance analysis of federated learning algorithms for multilingual protest news detection using pre-trained distilbert and BERT. IEEE Access, 11:134009–134022, 2023. doi: 10.1109/ACCESS.2023.3334910. URL https://doi.org/10.1109/ACCESS.2023.3334910.
  57. Can public large language models help private cross-device federated learning? CoRR, abs/2305.12132, 2023. doi: 10.48550/ARXIV.2305.12132. URL https://doi.org/10.48550/arXiv.2305.12132.
  58. Pretrained models for multilingual federated learning. In Marine Carpuat, Marie-Catherine de Marneffe, and Iván Vladimir Meza Ruíz, editors, Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022, Seattle, WA, United States, July 10-15, 2022, pages 1413–1421. Association for Computational Linguistics, 2022. doi: 10.18653/V1/2022.NAACL-MAIN.101. URL https://doi.org/10.18653/v1/2022.naacl-main.101.
  59. Towards building the federated GPT: federated instruction tuning. CoRR, abs/2305.05644, 2023. doi: 10.48550/ARXIV.2305.05644. URL https://doi.org/10.48550/arXiv.2305.05644.
  60. FATE-LLM: A industrial grade federated learning framework for large language models. CoRR, abs/2310.10049, 2023. doi: 10.48550/ARXIV.2310.10049. URL https://doi.org/10.48550/arXiv.2310.10049.
  61. Federatedscope-llm: A comprehensive package for fine-tuning large language models in federated learning. CoRR, abs/2309.00363, 2023. doi: 10.48550/ARXIV.2309.00363. URL https://doi.org/10.48550/arXiv.2309.00363.
  62. Low-parameter federated learning with large language models. CoRR, abs/2307.13896, 2023. doi: 10.48550/ARXIV.2307.13896. URL https://doi.org/10.48550/arXiv.2307.13896.
  63. Reducing communication overhead in federated learning for pre-trained language models using parameter-efficient finetuning. In Sarath Chandar, Razvan Pascanu, Hanie Sedghi, and Doina Precup, editors, Conference on Lifelong Learning Agents, 22-25 August 2023, McGill University, Montréal, Québec, Canada, volume 232 of Proceedings of Machine Learning Research, pages 456–469. PMLR, 2023. URL https://proceedings.mlr.press/v232/malaviya23a.html.
  64. Training large-vocabulary neural language models by private federated learning for resource-constrained devices. In IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2023, Rhodes Island, Greece, June 4-10, 2023, pages 1–5. IEEE, 2023. doi: 10.1109/ICASSP49357.2023.10096570. URL https://doi.org/10.1109/ICASSP49357.2023.10096570.
  65. Fwdllm: Efficient fedllm using forward gradient, 2024.
  66. Slora: Federated parameter efficient fine-tuning of language models. CoRR, abs/2308.06522, 2023. doi: 10.48550/ARXIV.2308.06522. URL https://doi.org/10.48550/arXiv.2308.06522.
  67. Client-customized adaptation for parameter-efficient federated learning. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, pages 1159–1172. Association for Computational Linguistics, 2023. doi: 10.18653/V1/2023.FINDINGS-ACL.75. URL https://doi.org/10.18653/v1/2023.findings-acl.75.
  68. The power of scale for parameter-efficient prompt tuning. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 3045–3059. Association for Computational Linguistics, 2021. doi: 10.18653/V1/2021.EMNLP-MAIN.243. URL https://doi.org/10.18653/v1/2021.emnlp-main.243.
  69. Fedprompt: Communication-efficient and privacy-preserving prompt tuning in federated learning. In IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2023, Rhodes Island, Greece, June 4-10, 2023, pages 1–5. IEEE, 2023b. doi: 10.1109/ICASSP49357.2023.10095356. URL https://doi.org/10.1109/ICASSP49357.2023.10095356.
  70. Federated learning of large language models with parameter-efficient prompt tuning and adaptive optimization. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 7871–7888. Association for Computational Linguistics, 2023. doi: 10.18653/V1/2023.EMNLP-MAIN.488. URL https://doi.org/10.18653/v1/2023.emnlp-main.488.
  71. Breaking physical and linguistic borders: Multilingual federated prompt tuning for low-resource languages. In The Twelfth International Conference on Learning Representations, 2024.
  72. OpenAI offers publishers as little as $1 million a year — the information, Jan 2024. URL https://www.theinformation.com/articles/openai-offers-publishers-as-little-as-1-million-a-year.
  73. Low-resource languages: A review of past work and future challenges. CoRR, abs/2006.07264, 2020. URL https://arxiv.org/abs/2006.07264.
  74. Neural machine translation for low-resource languages: A survey. ACM Comput. Surv., 55(11):229:1–229:37, 2023. doi: 10.1145/3567592. URL https://doi.org/10.1145/3567592.
  75. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
  76. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Lorenzo Sani (12 papers)
  2. Alex Iacob (11 papers)
  3. Zeyu Cao (15 papers)
  4. Bill Marino (7 papers)
  5. Yan Gao (157 papers)
  6. Tomas Paulik (2 papers)
  7. Wanru Zhao (16 papers)
  8. William F. Shen (11 papers)
  9. Preslav Aleksandrov (6 papers)
  10. Xinchi Qiu (26 papers)
  11. Nicholas D. Lane (97 papers)
Citations (6)
Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com