DiLoCo: Distributed Low-Communication Training of Language Models (2311.08105v3)

Published 14 Nov 2023 in cs.LG and cs.CL

Abstract: LLMs (LLM) have become a critical component in many applications of machine learning. However, standard approaches to training LLM require a large number of tightly interconnected accelerators, with devices exchanging gradients and other intermediate states at each optimization step. While it is difficult to build and maintain a single computing cluster hosting many accelerators, it might be easier to find several computing clusters each hosting a smaller number of devices. In this work, we propose a distributed optimization algorithm, Distributed Low-Communication (DiLoCo), that enables training of LLMs on islands of devices that are poorly connected. The approach is a variant of federated averaging, where the number of inner steps is large, the inner optimizer is AdamW, and the outer optimizer is Nesterov momentum. On the widely used C4 dataset, we show that DiLoCo on 8 workers performs as well as fully synchronous optimization while communicating 500 times less. DiLoCo exhibits great robustness to the data distribution of each worker. It is also robust to resources becoming unavailable over time, and vice versa, it can seamlessly leverage resources that become available during training.

PDF HTML Abstract

The paper introduces Distributed Low-Communication (DiLoCo) training, a distributed optimization algorithm designed for training LLMs on multiple, poorly connected clusters of devices. DiLoCo is presented as a solution to challenges associated with traditional LLM training, which requires a large number of tightly interconnected accelerators and presents engineering and infrastructure difficulties. DiLoCo is a variant of federated averaging, where the number of inner steps is large, the inner optimizer is AdamW, and the outer optimizer is Nesterov momentum.

The authors address the difficulties of co-locating and tightly synchronizing a large number of accelerators by drawing inspiration from Federated Learning. The core idea involves $k$ workers operating on their own "island" of devices, each consuming a partition of the data and updating a model replica. These workers perform local computations and exchange gradients every $H$ steps to synchronize their model replicas. The paper posits that DiLoCo addresses shortcomings by:

Reducing the number of co-located devices required for each worker.
Minimizing communication frequency between workers.
Accommodating heterogeneous devices across different islands.

The paper details the DiLoCo algorithm, which involves outer and inner optimization processes. The outer optimization (lines 1, 12, and 14 in Algorithm 1) consists of $T$ outer steps where gradients from each worker are gathered, averaged, and used by an outer optimizer to update a shared copy of the parameters. This shared copy is then re-dispatched to each local worker (line 3). Within each phase, each worker (line 3) performs its own inner optimization (lines 4 to 9) for $H$ steps using an inner optimizer. Each worker samples data from its own shard (line 5) and updates its own local copy of the parameters (line 8). The inner optimization consists of $H \gg 1$ steps. Communication across workers is minimal, occurring once every $H$ inner optimization steps. In total, a worker trains for $N = T \times H$ inner steps.

Specifically, the inner optimizer ( $InnerOpt$ ) is AdamW and the outer optimizer ( $OuterOpt$ ) is Nesterov momentum. When $OuterOpt$ is SGD, the outer optimizer is equivalent to classical Federated Averaging. If the total number of outer optimization steps $T$ is further set to 1, then DiLoCo reduces to "souping". Finally, if the number of inner optimization steps $H$ is set to 1 and $InnerOpt$ is SGD, DiLoCo is equivalent to large-batch training with data-parallelism.

The paper highlights that DiLoCo can be interpreted as a data parallelism method requiring very little communication, scaling to workers that are poorly connected, such as those in distant geographic regions.

The paper presents an empirical validation of DiLoCo on the C4 dataset. Three model sizes were considered, all decoder-only transformers adapted from the Chinchilla architecture. Experiments were conducted in both i.i.d. and non-i.i.d. settings. By default, training experiments start from a transformer LLM pretrained for $24{,}000$ steps on the same training set. A sequence length of $1{,}024$ tokens and a batch size of $512$ were used.

The performance of DiLoCo (with $k=8$ replicas in the non-i.i.d. data setting) was evaluated with each worker performing $T=128$ times $H=500$ inner steps ( $64{,}000$ steps in total), starting from a model $\theta^{(0)}$ pretrained for $24{,}000$ steps. This setup was compared against four baselines:

A model trained from scratch for $88{,}000$ steps.
A model pretrained for $24{,}000$ steps and finetuned for an additional $64{,}000$ steps.
A model pretrained for $24{,}000$ steps and finetuned with an $8\times$ larger batch size.
A model trained with the standard batch size for $8\times$ the number of updates.

The trade-offs between these baselines and DiLoCo were compared with respect to communication cost, training time, and the amount of compute used. The results indicated that DiLoCo does not increase training time, communicates $H=500\times$ less than the second baseline, and achieves better generalization performance.

Extensive ablations were performed to understand DiLoCo's capabilities and limitations.

Number of Pretraining Steps: The impact of the number of pretraining steps on final generalization performance in a non-i.i.d. data regime was examined. The number of pretraining steps was varied, and it was observed that starting DiLoCo before 24k steps achieves a similar final perplexity, demonstrating the robustness of the approach. Performance was not degraded even when starting from a randomly initialized network.
Communication Frequency: The communication frequency was varied for a 150M transformer in the non-i.i.d. data regime, from $H=50$ steps to $H=2000$ steps. More frequent communication generally improved generalization performance. Communicating more frequently than $H=500$ steps led to diminishing returns, with only a mild performance degradation up to $H=1000$ steps. Based on these considerations, $H=500$ was chosen as a trade-off between generalization performance and communication cost.
i.i.d. vs non-i.i.d. data regimes: The effect of different data distributions on the convergence of DiLoCo was assessed. The non-i.i.d. setting was created by clustering the entire training set using $k$ -Means on the pretrained model's last layer features. DiLoCo with $k=8$ workers/shards was compared in non-i.i.d. and i.i.d. settings. Despite faster early convergence in the i.i.d. setting, the final generalization performance was comparable, demonstrating DiLoCo's robustness.
Number of replicas: The impact of the number of replicas/clusters was investigated. Increasing the number of replicas improved generalization performance, but with diminishing returns beyond 8 workers. This applied to both i.i.d. and non-i.i.d. settings.
Model size: Models of size 60, 150, and 400 million parameters were trained with data distribution as non-i.i.d. and all workers starting from a model pretrained for $24{,}000$ steps. A monotonic improvement of performance was observed as the model size increased.
Outer Optimizers: Various outer optimizers were tested, including SGD, Adam, and Nesterov momentum. Nesterov optimizer performed the best. The setting with outer learning rate equal to $0.7$ and outer momentum equal to $0.9$ was found to be very robust.
Adaptive compute pool: The performance of DiLoCo was explored when the amount of compute varied throughout training, simulating scenarios with preemptible machines or collaborative systems. The amount of compute was varied by changing the number of replicas used in an i.i.d. setting. The determining factor for the model's generalization ability was the total amount of compute given to DiLoCo, with robustness to how the budget was spread over time.
Asynchronous Communication: The inability to communicate, simulating worker reboots or network issues, was modeled by randomly dropping outer gradients with varying probabilities. Higher drop probabilities resulted in more unstable learning with transient spikes in perplexity. However, even with a 50% drop probability in the non-i.i.d. setting, the degradation of perplexity was only 2.1%.
Accelerating a single worker: DiLoCo applied to a single replica/cluster ( $k=1$ but $H \gg 1$ ) improved both convergence speed and final generalization performance at null communication cost. Every $H=500$ inner steps, the only outer gradient was computed and the parameters were updated locally using the outer optimizer.

The paper also discusses related work in distributed learning, specifically local SGD and federated learning, and linear mode connectivity. It contrasts DiLoCo with existing approaches, highlighting its unique combination of techniques and its ability to scale to larger models and more diverse settings.

The paper concludes by outlining limitations of the paper and potential avenues for future research. These include: