Moshpit SGD: Communication-Efficient Decentralized Training on Heterogeneous Unreliable Devices (2103.03239v4)

Published 4 Mar 2021 in cs.LG, cs.DC, and math.OC

Abstract: Training deep neural networks on large datasets can often be accelerated by using multiple compute nodes. This approach, known as distributed training, can utilize hundreds of computers via specialized message-passing protocols such as Ring All-Reduce. However, running these protocols at scale requires reliable high-speed networking that is only available in dedicated clusters. In contrast, many real-world applications, such as federated learning and cloud-based distributed training, operate on unreliable devices with unstable network bandwidth. As a result, these applications are restricted to using parameter servers or gossip-based averaging protocols. In this work, we lift that restriction by proposing Moshpit All-Reduce - an iterative averaging protocol that exponentially converges to the global average. We demonstrate the efficiency of our protocol for distributed optimization with strong theoretical guarantees. The experiments show 1.3x speedup for ResNet-50 training on ImageNet compared to competitive gossip-based strategies and 1.5x speedup when training ALBERT-large from scratch using preemptible compute nodes.

Authors (4)

Max Ryabinin (29 papers)
Eduard Gorbunov (65 papers)
Vsevolod Plokhotnyuk (2 papers)
Gennady Pekhimenko (52 papers)

Citations (26)

View on Semantic Scholar

Summary

Moshpit SGD: Enhanced Decentralized Training

The paper "Moshpit SGD: Communication-Efficient Decentralized Training on Heterogeneous Unreliable Devices" presents a novel decentralized averaging protocol, termed Moshpit All-Reduce, optimized for distributed optimization in environments with constrained and unreliable communication networks. This approach addresses the limitations inherent to traditional distributed training methods, which are frequently contingent upon specialized, high-speed networking hardware that is not feasible for several applications, such as federated learning and cloud-based distributed training on preemptible compute nodes.

Core Contributions

The paper introduces distinct contributions, notably:

Moshpit All-Reduce Protocol: An innovative decentralized averaging protocol designed to facilitate large-scale neural network training on devices constrained by unreliable communication. This protocol demonstrates exponential convergence independent of network topology, which is a significant achievement considering the dynamic nature of decentralized networks.
Moshpit SGD Algorithm: Building on Moshpit All-Reduce, the authors propose Moshpit SGD, an algorithm for distributed optimization that achieves convergence rates akin to Centralized Local SGD under realistic assumptions, encompassing node failures and dynamic participation.
Empirical Validation: The efficiency and robustness of Moshpit All-Reduce are empirically validated through experiments demonstrating its superior performance compared to existing decentralized strategies in training ResNet-50 on ImageNet and ALBERT-large, achieving considerable speedups in training time.

Theoretical Insights

The theoretical analysis establishes strong convergence guarantees for Moshpit SGD. Unlike gossip-based algorithms, which heavily depend on the spectral properties of the communication graph, Moshpit All-Reduce offers exponential convergence that bypasses such dependence. This is encapsulated in Theorem 4.3, showing that the protocol achieves a reduction in average distortion swiftly, irrespective of the network's spectral gap, suggesting potential scalability advantages over traditional approaches.

Practical and Theoretical Implications

The proposed method bears notable implications both theoretically and practically:

Cost-Effective Training: Moshpit SGD allows for robust training on cheaper, less reliable hardware setups, providing a cost-efficient alternative to dedicated HPC infrastructure. This could democratize access to large-scale model training, widening participation across institutions with varying resources.
Scalability: The scalability of Moshpit All-Reduce stems from its ability to operate efficiently in heterogeneous environments without dedicated high-speed interconnects. This adaptability suggests promising applications in federated learning contexts, where data privacy dictates distributed training across numerous, disparate nodes.
Fault Tolerance: The approach is distinctively robust to node failures and network instability, making it suitable for environments with intermittent connectivity or varying computational power, such as volunteer computing contexts.

Future Directions

Potential future research directions indicated by the paper include exploring the interactions between Moshpit All-Reduce and communication compression techniques, expanding its application scope to collaborative network training, and refining the group arrangement mechanisms to enhance performance further. Additionally, the integration of Moshpit All-Reduce with existing parameters servers and exploring its impact on gradient compression could further enhance efficiency.

In conclusion, Moshpit SGD establishes a compelling method for decentralized training in unreliable environments, showcasing both empirical success and theoretical robustness. It alleviates key challenges in existing decentralized learning frameworks and sets a precedent for future innovations in scalable, cost-effective distributed training methodologies.

PDF Markdown

Related Papers

Tweets

https://twitter.com/Ar_Douillard/status/1933475268720799927