Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DiLoCo: Distributed Low-Communication Training of Language Models (2311.08105v3)

Published 14 Nov 2023 in cs.LG and cs.CL
DiLoCo: Distributed Low-Communication Training of Language Models

Abstract: LLMs (LLM) have become a critical component in many applications of machine learning. However, standard approaches to training LLM require a large number of tightly interconnected accelerators, with devices exchanging gradients and other intermediate states at each optimization step. While it is difficult to build and maintain a single computing cluster hosting many accelerators, it might be easier to find several computing clusters each hosting a smaller number of devices. In this work, we propose a distributed optimization algorithm, Distributed Low-Communication (DiLoCo), that enables training of LLMs on islands of devices that are poorly connected. The approach is a variant of federated averaging, where the number of inner steps is large, the inner optimizer is AdamW, and the outer optimizer is Nesterov momentum. On the widely used C4 dataset, we show that DiLoCo on 8 workers performs as well as fully synchronous optimization while communicating 500 times less. DiLoCo exhibits great robustness to the data distribution of each worker. It is also robust to resources becoming unavailable over time, and vice versa, it can seamlessly leverage resources that become available during training.

The paper introduces Distributed Low-Communication (DiLoCo) training, a distributed optimization algorithm designed for training LLMs on multiple, poorly connected clusters of devices. DiLoCo is presented as a solution to challenges associated with traditional LLM training, which requires a large number of tightly interconnected accelerators and presents engineering and infrastructure difficulties. DiLoCo is a variant of federated averaging, where the number of inner steps is large, the inner optimizer is AdamW, and the outer optimizer is Nesterov momentum.

The authors address the difficulties of co-locating and tightly synchronizing a large number of accelerators by drawing inspiration from Federated Learning. The core idea involves kk workers operating on their own "island" of devices, each consuming a partition of the data and updating a model replica. These workers perform local computations and exchange gradients every HH steps to synchronize their model replicas. The paper posits that DiLoCo addresses shortcomings by:

  • Reducing the number of co-located devices required for each worker.
  • Minimizing communication frequency between workers.
  • Accommodating heterogeneous devices across different islands.

The paper details the DiLoCo algorithm, which involves outer and inner optimization processes. The outer optimization (lines 1, 12, and 14 in Algorithm 1) consists of TT outer steps where gradients from each worker are gathered, averaged, and used by an outer optimizer to update a shared copy of the parameters. This shared copy is then re-dispatched to each local worker (line 3). Within each phase, each worker (line 3) performs its own inner optimization (lines 4 to 9) for HH steps using an inner optimizer. Each worker samples data from its own shard (line 5) and updates its own local copy of the parameters (line 8). The inner optimization consists of H1H \gg 1 steps. Communication across workers is minimal, occurring once every HH inner optimization steps. In total, a worker trains for N=T×HN = T \times H inner steps.

Specifically, the inner optimizer (InnerOptInnerOpt) is AdamW and the outer optimizer (OuterOptOuterOpt) is Nesterov momentum. When OuterOptOuterOpt is SGD, the outer optimizer is equivalent to classical Federated Averaging. If the total number of outer optimization steps TT is further set to 1, then DiLoCo reduces to "souping". Finally, if the number of inner optimization steps HH is set to 1 and InnerOptInnerOpt is SGD, DiLoCo is equivalent to large-batch training with data-parallelism.

The paper highlights that DiLoCo can be interpreted as a data parallelism method requiring very little communication, scaling to workers that are poorly connected, such as those in distant geographic regions.

The paper presents an empirical validation of DiLoCo on the C4 dataset. Three model sizes were considered, all decoder-only transformers adapted from the Chinchilla architecture. Experiments were conducted in both i.i.d. and non-i.i.d. settings. By default, training experiments start from a transformer LLM pretrained for 24,00024{,}000 steps on the same training set. A sequence length of 1,0241{,}024 tokens and a batch size of $512$ were used.

The performance of DiLoCo (with k=8k=8 replicas in the non-i.i.d. data setting) was evaluated with each worker performing T=128T=128 times H=500H=500 inner steps (64,00064{,}000 steps in total), starting from a model θ(0)\theta^{(0)} pretrained for 24,00024{,}000 steps. This setup was compared against four baselines:

  1. A model trained from scratch for 88,00088{,}000 steps.
  2. A model pretrained for 24,00024{,}000 steps and finetuned for an additional 64,00064{,}000 steps.
  3. A model pretrained for 24,00024{,}000 steps and finetuned with an 8×8\times larger batch size.
  4. A model trained with the standard batch size for 8×8\times the number of updates.

The trade-offs between these baselines and DiLoCo were compared with respect to communication cost, training time, and the amount of compute used. The results indicated that DiLoCo does not increase training time, communicates H=500×H=500\times less than the second baseline, and achieves better generalization performance.

Extensive ablations were performed to understand DiLoCo's capabilities and limitations.

  • Number of Pretraining Steps: The impact of the number of pretraining steps on final generalization performance in a non-i.i.d. data regime was examined. The number of pretraining steps was varied, and it was observed that starting DiLoCo before 24k steps achieves a similar final perplexity, demonstrating the robustness of the approach. Performance was not degraded even when starting from a randomly initialized network.
  • Communication Frequency: The communication frequency was varied for a 150M transformer in the non-i.i.d. data regime, from H=50H=50 steps to H=2000H=2000 steps. More frequent communication generally improved generalization performance. Communicating more frequently than H=500H=500 steps led to diminishing returns, with only a mild performance degradation up to H=1000H=1000 steps. Based on these considerations, H=500H=500 was chosen as a trade-off between generalization performance and communication cost.
  • i.i.d. vs non-i.i.d. data regimes: The effect of different data distributions on the convergence of DiLoCo was assessed. The non-i.i.d. setting was created by clustering the entire training set using kk-Means on the pretrained model's last layer features. DiLoCo with k=8k=8 workers/shards was compared in non-i.i.d. and i.i.d. settings. Despite faster early convergence in the i.i.d. setting, the final generalization performance was comparable, demonstrating DiLoCo's robustness.
  • Number of replicas: The impact of the number of replicas/clusters was investigated. Increasing the number of replicas improved generalization performance, but with diminishing returns beyond 8 workers. This applied to both i.i.d. and non-i.i.d. settings.
  • Model size: Models of size 60, 150, and 400 million parameters were trained with data distribution as non-i.i.d. and all workers starting from a model pretrained for 24,00024{,}000 steps. A monotonic improvement of performance was observed as the model size increased.
  • Outer Optimizers: Various outer optimizers were tested, including SGD, Adam, and Nesterov momentum. Nesterov optimizer performed the best. The setting with outer learning rate equal to $0.7$ and outer momentum equal to $0.9$ was found to be very robust.
  • Adaptive compute pool: The performance of DiLoCo was explored when the amount of compute varied throughout training, simulating scenarios with preemptible machines or collaborative systems. The amount of compute was varied by changing the number of replicas used in an i.i.d. setting. The determining factor for the model's generalization ability was the total amount of compute given to DiLoCo, with robustness to how the budget was spread over time.
  • Asynchronous Communication: The inability to communicate, simulating worker reboots or network issues, was modeled by randomly dropping outer gradients with varying probabilities. Higher drop probabilities resulted in more unstable learning with transient spikes in perplexity. However, even with a 50% drop probability in the non-i.i.d. setting, the degradation of perplexity was only 2.1%.
  • Accelerating a single worker: DiLoCo applied to a single replica/cluster (k=1k=1 but H1H \gg 1) improved both convergence speed and final generalization performance at null communication cost. Every H=500H=500 inner steps, the only outer gradient was computed and the parameters were updated locally using the outer optimizer.

The paper also discusses related work in distributed learning, specifically local SGD and federated learning, and linear mode connectivity. It contrasts DiLoCo with existing approaches, highlighting its unique combination of techniques and its ability to scale to larger models and more diverse settings.

The paper concludes by outlining limitations of the paper and potential avenues for future research. These include:

  • Evaluating DiLoCo on other tasks and architectures.
  • Scaling DiLoCo to models with billions of parameters.
  • Extending DiLoCo to asynchronous settings with heterogeneous workers.
  • Improving the algorithm to better leverage additional compute.
  • Balancing wall-clock time efficiency with compute and data efficiency.
Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Petals: Collaborative inference and fine-tuning of large models. arXiv preprint library, 2022.
  2. Imagenet: A large-scale hierarchical image database. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
  3. Distributed deep learning in open collaborations. Advances in Neural Information Processing Systems (NeurIPS), 2021.
  4. Linear mode connectivity and the lottery ticket hypothesis. International Conference on Machine Learning (ICML), 2020.
  5. A survey on heterogeneous federated learning. arXiv preprint library, 2022.
  6. Why (and when) does local sgd generalize better than sgd? Proceedings of the International Conference on Learning Representations (ICLR), 2023.
  7. Continual pre-training of large language models: How to (re)warm your model? arXiv preprint library, 2023.
  8. Scaling expert language models with unsupervised domain discovery. arXiv preprint library, 2023.
  9. Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  10. Training compute-optimal large language models. Advances in Neural Information Processing Systems (NeurIPS), 2022.
  11. Faster on-device training using new federated momentum algorithm. arXiv preprint library, 2020.
  12. Patching open-vocabulary models by interpolating weights. Advances in Neural Information Processing Systems (NeurIPS), 2022.
  13. Dataless knowledge fusion by merging weights of language models. Proceedings of the International Conference on Learning Representations (ICLR), 2023.
  14. Population parameter averaging (papa). arXiv preprint library, 2023.
  15. Repair: Renormalizing permuted activations for interpolation repair. arXiv preprint library, 2023.
  16. Jean Kaddour. Stop wasting my time! saving days of imagenet and bert training with latest weight averaging. Advances in Neural Information Processing Systems (NeurIPS) Workshop, 2022.
  17. Git-theta: A git extension for collaborative development of machine learning models. arXiv preprint library, 2023.
  18. Adam: A method for stochastic optimization. Proceedings of the International Conference on Learning Representations (ICLR), 2014.
  19. Branch-train-merge: Embarrassingly parallel training of expert language models. arXiv preprint library, 2022.
  20. Don’t use large mini-batches, use local sgd. Proceedings of the International Conference on Learning Representations (ICLR), 2020.
  21. Decoupled weight decay regularization. Proceedings of the International Conference on Learning Representations (ICLR), 2019.
  22. Communication-efficient learning of deep networks from decentralized data. International Conference on Artificial Intelligence and Statistics (AISTATS), 2017.
  23. Trade-offs of local sgd at scale: An empirical study. arXiv preprint library, 2021.
  24. Shawn Presser. Swarm training, 2020. URL https://battle.shawwn.com/swarm-training-v01a.pdf.
  25. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 2020.
  26. Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards. Advances in Neural Information Processing Systems (NeurIPS), 2023a.
  27. Diverse weight averaging for out-of-distribution generalization. Advances in Neural Information Processing Systems (NeurIPS), 2023b.
  28. Revisiting adapters with adversarial training. Proceedings of the International Conference on Learning Representations (ICLR), 2022.
  29. Adaptive federated optimization. Proceedings of the International Conference on Learning Representations (ICLR), 2021.
  30. Moshpit sgd: Communication-efficient decentralized training on heterogeneous unreliable devices. Advances in Neural Information Processing Systems (NeurIPS), 2021.
  31. Sebastian U. Stich. Local SGD converges fast and communicates little. Proceedings of the International Conference on Learning Representations (ICLR), 2019.
  32. Zipit! merging models from different tasks without training. arXiv preprint library, 2023.
  33. On the importance of initialization and momentum in deep learning. International Conference on Machine Learning (ICML), 2013.
  34. Communication-efficient distributed deep learning: A comprehensive survey. arXiv preprint library, 2023.
  35. Attention is all you need. Advances in Neural Information Processing Systems (NeurIPS), 2017.
  36. Slowmo: Improving communication-efficient distributed sgd with slow momentum. Proceedings of the International Conference on Learning Representations (ICLR), 2020.
  37. Learning neural network subspaces. International Conference on Machine Learning (ICML), 2021.
  38. lo-fi: distributed fine-tuning without communication. arXiv preprint library, 2022a.
  39. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. International Conference on Machine Learning (ICML), 2022b.
  40. Robust fine-tuning of zero-shot models. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022c.
  41. Resolving interference when merging models. Advances in Neural Information Processing Systems (NeurIPS), 2023.
  42. Lookahead optimizer: k steps forward, 1 step back. Advances in Neural Information Processing Systems (NeurIPS), 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Arthur Douillard (20 papers)
  2. Qixuan Feng (5 papers)
  3. Andrei A. Rusu (18 papers)
  4. Rachita Chhaparia (5 papers)
  5. Yani Donchev (3 papers)
  6. Adhiguna Kuncoro (18 papers)
  7. Marc'Aurelio Ranzato (53 papers)
  8. Arthur Szlam (86 papers)
  9. Jiajun Shen (35 papers)
Citations (16)
Youtube Logo Streamline Icon: https://streamlinehq.com