Create a Video View Paper

Decoupled DiLoCo: Asynchronous Distributed Training That Refuses to Fail

This presentation explores Decoupled DiLoCo, a revolutionary approach to large-scale language model pre-training that abandons global synchronization in favor of asynchronous, fault-tolerant learning. By decoupling independent learners that synchronize through minimum-quorum protocols, the system achieves 88% goodput under extreme hardware failures—vastly outperforming traditional elastic data-parallel training—while maintaining zero degradation in model quality. The talk reveals how this architecture unlocks resilient, bandwidth-efficient training across heterogeneous and geo-distributed compute resources.

Script

When a single chip fails in traditional large language model training, the entire cluster grinds to a halt. Decoupled DiLoCo changes that equation completely, confining failures to isolated learners while the rest of the system keeps training.

The authors decompose the training cluster into independent learners that execute local optimization without coordination. Each learner periodically synchronizes parameter fragments with a central syncer using minimum-quorum logic, so only a subset of learners is needed to proceed.

Three core parameters define the protocol: minimum quorum determines how many learners must participate, adaptive grace windows allow the syncer to wait briefly for additional updates without blocking progress, and token-weighted averaging compensates for speed differences between fast and slow learners.

Under extreme simulated fault rates, Decoupled DiLoCo sustains 88 percent goodput compared to 58 percent for elastic data parallel. Remarkably, there is no statistically significant degradation in model quality on text and vision tasks, even under continuous hardware interruptions.

Heterogeneity experiments reveal full compute utilization regardless of learner speed variability. With a quorum of 1 and adaptive grace windows, the protocol achieves maximum availability and sidesteps the slowest-chip bottleneck entirely.

By prioritizing availability and partition tolerance over strict consistency, Decoupled DiLoCo enables robust pre-training across geo-distributed, heterogeneous, and even space-based clusters. Explore the full paper and create your own video at EmergentMind.com.