Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Asynchronous Local-SGD Training for Language Modeling (2401.09135v2)

Published 17 Jan 2024 in cs.LG and cs.CL
Asynchronous Local-SGD Training for Language Modeling

Abstract: Local stochastic gradient descent (Local-SGD), also referred to as federated averaging, is an approach to distributed optimization where each device performs more than one SGD update per communication. This work presents an empirical study of {\it asynchronous} Local-SGD for training LLMs; that is, each worker updates the global parameters as soon as it has finished its SGD steps. We conduct a comprehensive investigation by examining how worker hardware heterogeneity, model size, number of workers, and optimizer could impact the learning performance. We find that with naive implementations, asynchronous Local-SGD takes more iterations to converge than its synchronous counterpart despite updating the (global) model parameters more frequently. We identify momentum acceleration on the global parameters when worker gradients are stale as a key challenge. We propose a novel method that utilizes a delayed Nesterov momentum update and adjusts the workers' local training steps based on their computation speed. This approach, evaluated with models up to 150M parameters on the C4 dataset, matches the performance of synchronous Local-SGD in terms of perplexity per update step, and significantly surpasses it in terms of wall clock time.

Introduction to Asynchronous Local-SGD

LLMs have become crucial in the advancement of machine learning applications, particularly in the field of natural language processing. The traditional way to train such models often involves multiple devices working in tandem using synchronous updates, which can lead to inefficiencies because of communication latency between distributed devices.

Understanding Local-SGD and Its Asynchronous Variant

Local Stochastic Gradient Descent (Local-SGD) offers a way to mitigate the communication bottleneck in distributed training by allowing devices to perform several gradient steps locally before synchronizing. Asynchronous Local-SGD, on the other hand, presents a more dynamic approach, where devices update the global model parameters as soon as they complete their local updates, avoiding the idle time associated with the synchronous method. However, despite its potential, naïve implementations of asynchronous Local-SGD could lead to slower convergence than expected.

Momentum and Heterogeneity in Asynchronous Training

The paper reveals a key issue in asynchronous Local-SGD: the use of stale gradients combined with momentum can disrupt the learning process. A stale gradient arises when a device computes updates based on an older version of the model due to inevitable asynchrony. This complication becomes evident with momentum, which accelerates training by combining past gradients with the current one - the paper explores the intricacies of this phenomenon. To address the identified challenges, the researchers propose two techniques: Delayed Nesterov (DN) momentum update and Dynamic Local Updates (DyLU). These methods are designed to stabilize and improve the performance of asynchronous Local-SGD for LLMs.

Experimenting with Novel Techniques

The paper conducts extensive experiments that demonstrate DN and DyLU's ability to match or even surpass synchronous Local-SGD in terms of learning effectiveness and time efficiency, showing promise for these novel methods. The experiments elaborate on how these techniques cope with heterogeneity in device capabilities and variations in the number of workers and model sizes, indicating the methods' robustness and scalability potential.

Concluding Thoughts

In conclusion, asynchronous Local-SGD presents an attractive alternative for efficiently training LLMs across distributed systems. The paper contributes to this burgeoning domain by addressing key challenges and proposing viable solutions that have been empirically validated. The research opens doors to further enhancements in distributed learning, aiming for greater scalability and reduced training time without compromising the quality of LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. On data dependence in distributed stochastic optimization. arXiv preprint arXiv:1603.04379, 2016.
  2. Petals: Collaborative inference and fine-tuning of large models. arXiv preprint library, 2022.
  3. Gregory Francis Coppola. Iterative parameter mixing for distributed large-margin training of structured predictors for natural language processing. 2015.
  4. Large scale distributed deep networks. Advances in neural information processing systems, 25, 2012.
  5. Distributed deep learning in open collaborations. Advances in Neural Information Processing Systems (NeurIPS), 2021a.
  6. Distributed deep learning in open collaborations. Advances in Neural Information Processing Systems, 34:7879–7897, 2021b.
  7. Diloco: Distributed low-communication training of language models. arXiv preprint arXiv:2311.08105, 2023.
  8. Why (and when) does local sgd generalize better than sgd? arXiv preprint arXiv:2303.01215, 2023.
  9. Scaling federated learning for fine-tuning of large language models. In International Conference on Applications of Natural Language to Information Systems, pages 15–23. Springer, 2021.
  10. Training compute-optimal large language models. Advances in Neural Information Processing Systems (NeurIPS), 2022.
  11. Scaffold: Stochastic controlled averaging for federated learning. In International conference on machine learning, pages 5132–5143. PMLR, 2020.
  12. Breaking the centralized barrier for cross-device federated learning. Advances in Neural Information Processing Systems, 34:28663–28676, 2021.
  13. Parallel asynchronous particle swarm optimization. International journal for numerical methods in engineering, 67(4):578–595, 2006.
  14. Asynchronous parallel stochastic gradient for nonconvex optimization. Advances in neural information processing systems, 28, 2015.
  15. Asynchronous decentralized parallel stochastic gradient descent. In International Conference on Machine Learning, pages 3043–3052. PMLR, 2018.
  16. Don’t use large mini-batches, use local sgd. arXiv preprint arXiv:1808.07217, 2018.
  17. Don’t use large mini-batches, use local sgd. Proceedings of the International Conference on Learning Representations (ICLR), 2020.
  18. Distributed training strategies for the structured perceptron. In Human language technologies: The 2010 annual conference of the North American chapter of the association for computational linguistics, pages 456–464, 2010.
  19. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pages 1273–1282. PMLR, 2017.
  20. Federated learning with buffered asynchronous aggregation. In International Conference on Artificial Intelligence and Statistics, pages 3581–3607. PMLR, 2022.
  21. Shawn Presser. Swarm training, 2020. URL https://battle.shawwn.com/swarm-training-v01a.pdf.
  22. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 2020.
  23. Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. Advances in neural information processing systems, 24, 2011.
  24. Adaptive federated optimization. arXiv preprint arXiv:2003.00295, 2020.
  25. Scaling language model size in cross-device federated learning. arXiv preprint arXiv:2204.09715, 2022.
  26. Moshpit sgd: Communication-efficient decentralized training on heterogeneous unreliable devices. Advances in Neural Information Processing Systems, 34:18195–18211, 2021.
  27. Sebastian U Stich. Local sgd converges fast and communicates little. arXiv preprint arXiv:1805.09767, 2018.
  28. Asynchronous federated optimization. arXiv preprint arXiv:1903.03934, 2019.
  29. Parallel sgd: When does averaging help? arXiv preprint arXiv:1606.07365, 2016.
  30. Timelyfl: Heterogeneity-aware asynchronous federated learning with adaptive partial training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5063–5072, 2023.
  31. Asynchronous stochastic gradient descent with delay compensation. In International Conference on Machine Learning, pages 4120–4129. PMLR, 2017.
  32. Parallelized stochastic gradient descent. Advances in neural information processing systems, 23, 2010.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Bo Liu (484 papers)
  2. Rachita Chhaparia (5 papers)
  3. Arthur Douillard (20 papers)
  4. Satyen Kale (50 papers)
  5. Andrei A. Rusu (18 papers)
  6. Jiajun Shen (35 papers)
  7. Arthur Szlam (86 papers)
  8. Marc'Aurelio Ranzato (53 papers)
Citations (5)
Youtube Logo Streamline Icon: https://streamlinehq.com