Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch (2501.18512v1)

Published 30 Jan 2025 in cs.CL

Abstract: Training of LLMs is typically distributed across a large number of accelerators to reduce training time. Since internal states and parameter gradients need to be exchanged at each and every single gradient step, all devices need to be co-located using low-latency high-bandwidth communication links to support the required high volume of exchanged bits. Recently, distributed algorithms like DiLoCo have relaxed such co-location constraint: accelerators can be grouped into ``workers'', where synchronizations between workers only occur infrequently. This in turn means that workers can afford being connected by lower bandwidth communication links without affecting learning quality. However, in these methods, communication across workers still requires the same peak bandwidth as before, as the synchronizations require all parameters to be exchanged across all workers. In this paper, we improve DiLoCo in three ways. First, we synchronize only subsets of parameters in sequence, rather than all at once, which greatly reduces peak bandwidth. Second, we allow workers to continue training while synchronizing, which decreases wall clock time. Third, we quantize the data exchanged by workers, which further reduces bandwidth across workers. By properly combining these modifications, we show experimentally that we can distribute training of billion-scale parameters and reach similar quality as before, but reducing required bandwidth by two orders of magnitude.

PDF Abstract

The paper introduces Streaming DiLoCo, an enhanced distributed training algorithm for LLM that builds upon the original DiLoCo algorithm. Streaming DiLoCo addresses the limitations of conventional data-parallel training and DiLoCo by reducing peak bandwidth requirements, mitigating worker blocking, and decreasing the total amount of exchanged bits without compromising learning efficiency. The key idea is to synchronize only subsets of parameters in sequence, allow workers to continue training while synchronizing, and quantize the data exchanged by workers.

The authors identify the following three main contributions:

A parameter synchronization scheme where subsets of parameters are synchronized on a schedule, rather than synchronizing all parameters at once. This reduces the peak bandwidth required during synchronization.
A method to overlap worker computation with communication of synchronizations, which increases the tolerated latency of communication.
A low-precision communication strategy that compresses outer gradients to four bits per parameter without a significant loss of performance, thereby reducing the total amount of exchanged bits.

The algorithm, FedOpt, forms the basis for DiLoCo. FedOpt involves $M$ local replicas performing $H$ steps of inner independent optimizations on a subset of data. Every $H$ steps, each replica computes an outer gradient $\Delta_m^t = \theta_m^{(t-H)} - \theta_m^{(t)}$ which represents the change in parameter space, and communicates this to all other replicas. The communication results in each worker obtaining $\Delta^t = \nicefrac{1}{M} \sum_{m=1}^M \Delta^t_m$. This outer gradient is then applied to a set of outer parameters, $\theta_m^{(t-H)}$ , using an outer optimizer.

DiLoCo is an instantiation of FedOpt where the inner optimizer is Adam and the outer optimizer is SGD with Nesterov momentum.

Streaming DiLoCo shares a fragment $p$ of the outer gradient, $\Delta_{m,p}^{(t)}$ , more frequently. The network is partitioned into $P$ fragments made of several transformer blocks. The authors paper two fragment patterns:

Sequential, where each fragment comprises consecutive transformer blocks.
Strided, where each fragment is composed of interleaved transformer blocks.

In Streaming DiLoCo's inner optimization (lines 3-5 of Algorithm 2), the outer optimization (line 12) is performed per fragment. If fragment $p$ satisfies $t+t_p \mod H = 0$ where $t_p$ is a time offset fragment-dependent, then it is synchronized. Each fragment will always do $H$ steps before being synchronized. While Streaming DiLoCo communicates more often than DiLoCo given an equal $H$ , the peak communication is reduced by a factor of $\nicefrac{|p|}{L}$ with $|p|$ representing the size of a fragment in layers and $L$ representing the total number of layers.

To maximize the time spent on computation versus communication, the authors propose to overlap the communication of the outer gradient fragment with the inner optimization computation. This overlapping is done with a strictly positive $\tau$ . At the beginning of outer step $t+1$ , a new round of optimization is started rather than waiting for the communication of the fragment. After $\tau-1$ inner steps, the algorithm block-waits for the exchanged fragment, applies the outer optimizer on the previously synchronized fragment ( $\theta^{(t-\tau-H)}_{m,p^{(t-\tau)}}$ ), and merges it with the currently optimized fragment using a mixing factor $\alpha$ . With $\alpha=1$ , there is no communication between replicas. With $\alpha=0$ , any updates done in the first $\tau$ steps on fragment $p$ are discarded. With $\alpha=0.5$ , a uniform average between the local fragment parameters and the globally shared parameters is performed.

To reduce the total amount of bits exchanged, the authors use lower-precision in the outer gradients exchanged by workers, using up to 4 bits (1 sign bit, 3 exponent bits, and 0 mantissa bit) which they call E3M0. Although the outer gradients are exchanged using lower precision, the accumulation is performed in FP32 for stability.

The memory overhead of the Data-Parallel baseline is the parameters ( $1\times$ ) + Adam state ( $2\times$ ). DiLoCo's memory overhead is the parameters ( $1\times$ ), the Adam state ( $2\times$ ), the outer global parameters ( $1\times$ ), and the outer Nesterov state ( $1\times$ ).

A simulator was constructed to estimate the compute utilization of each method. This simulation is a DAG with four different types of nodes: forward, backward w.r.t activations and parameters, and outer gradients reduction. Each node represents a single layer. Therefore, the total number of nodes, for a single step, is $4 \times L - 1$ (because the backward w.r.t activations of the first layer is not needed). The simulation results suggest that Streaming DiLoCo improves the compute utilization of DiLoCo, and only overlapping communication with computation can reach nearly full compute utilization. Further, the required bandwidth can decrease as the model scale increases when overlapping communication with computation because a longer compute step time (forward {paper_content} backward) will provide more time to perform the synchronization across workers.

Scaling experiments were performed on C4, with models ranging from 35 million parameters to 4 billion parameters, all with a sequence length of $1{,}024$ . Data-Parallel, DiLoCo with $H=30$ inner steps, and Streaming DiLoCo with $H=30$ performed similarly. Streaming DiLoCo with more inner steps $H=100$ had slightly worse initial performance, but the loss improved proportionally better as the model scaled.

The authors compared a Data-Parallel baseline versus Streaming DiLoCo with overlapped FP4 communication on the Dolma dataset with a 1 billion parameter model and with a token budget of 25, 100, and 250 billion tokens using a sequence length of $2{,}048$ . Both methods performed similarly with respect to loss and accuracy on downstream tasks. For Data-Parallel, the amount of bits exchanged between non-colocated devices over the course of training was $400\times$ higher, and the peak bandwidth was reduced by $\frac{\text{num layers = } 24}{\text{fragment size = } 3} = 8\times$ .

Ablations were performed on a model of size 500 million parameters using the C4 dataset with a token budget of 11 billion. Based on ablation experiments, a fixed fragment size of 3 layers was chosen. A strided pattern was chosen because ML performance was slightly better for the fragment size considered, deeper networks with a small fragment size should benefit more from striding by spreading out up-to-date synchronized layers across the full depth of the network, and it slightly improves compute utilization. With respect to overlapping, the degradation in evaluation loss was negligible up to an overlap of 10 inner steps ( $< 0.2\%$ ).

The authors compared Streaming DiLoCo with and without the frozen pattern proposed by FedPart, reaching respectively on the C4 eval loss $3.2145$ and $2.6749$. Freezing the $18-3=15$ layers that won't be synchronized at the given round therefore results in a 20% increase of the evaluation loss.

The authors ablated two ways of compressing the outer gradients: either by setting to zero some values, or by lowering the precision. Interestingly, lowering the precision, from float32 to float4 did not affect the performance, while setting some values to zero was significantly worse.

The authors conclude that Streaming DiLoCo achieves similar ML performance as classical Data-Parallel training, while using $400\times$ less bandwidth, reducing the peak bandwidth, and allowing communication to have an ideal non-zero latency by overlapping it with computation.