Ravnest: Decentralized Asynchronous Training on Heterogeneous Devices (2401.01728v2)

Published 3 Jan 2024 in cs.LG, cs.AI, and cs.DC

Abstract: Modern deep learning models, growing larger and more complex, have demonstrated exceptional generalization and accuracy due to training on huge datasets. This trend is expected to continue. However, the increasing size of these models poses challenges in training, as traditional centralized methods are limited by memory constraints at such scales. This paper proposes an asynchronous decentralized training paradigm for large modern deep learning models that harnesses the compute power of regular heterogeneous PCs with limited resources connected across the internet to achieve favourable performance metrics. Ravnest facilitates decentralized training by efficiently organizing compute nodes into clusters with similar data transfer rates and compute capabilities, without necessitating that each node hosts the entire model. These clusters engage in $\textit{Zero-Bubble Asynchronous Model Parallel}$ training, and a $\textit{Parallel Multi-Ring All-Reduce}$ method is employed to effectively execute global parameter averaging across all clusters. We have framed our asynchronous SGD loss function as a block structured optimization problem with delayed updates and derived an optimal convergence rate of $O\left(\frac{1}{\sqrt{K}}\right)$. We further discuss linear speedup with respect to the number of participating clusters and the bound on the staleness parameter.

References (54)

Summary

The paper presents a novel framework that integrates model and data parallelism to train deep learning models on diverse hardware efficiently.
It demonstrates that asynchronous training using Zero-Bubble and Parallel Multi-Ring All-Reduce methods achieves robust convergence and linear speedup.
The study introduces a genetic algorithm for node clustering and fault-tolerant strategies to optimize resource allocation in decentralized environments.

Introduction to Ravnest

The rapidly advancing field of deep learning is accompanied by growing model complexities and increased computational demands. Deep learning models such as LLMs and multi-modal architectures necessitate powerful hardware to handle their extensive training requirements. Traditional centralized training methods, relying on each system to contain a full model copy, are becoming progressively unfit for this task. This paper presents a novel approach titled "Ravnest," which merges the benefits of data and model parallelism to train complex models in a decentralized, asynchronous fashion without placing excessive strain on hardware.

Efficient Asynchronous Training

Ravnest utilizes numerous PCs with varied capabilities, synergizing them into clusters that share similar data transfer capacities. Through what's termed "Zero-Bubble Asynchronous Model Parallel training," the clusters - within which the global model is divided such that each node trains only a segment of the model - manage their portions of the computational load. Adding to this, a "Parallel Multi-Ring All-Reduce" method is adopted for the purpose of parameter averaging across clusters. Significant because it eliminates the need for models to be present entirely on each node, Ravnest adeptly handles substantial data sets on modest systems, effectively bolstering efficiency and inclusivity in deep learning research.

Theoretical Underpinnings and Contributions

Delving into the mechanics of Ravnest, this research extrapolates that using stochastic gradient descent in a block-structured optimization problem allows for a viable convergence rate. This theoretical underpinning showcases Ravnest's ability to maintain a robust update path despite the delay or 'staleness' associated with asynchronous updates. The paper outlines several key contributions, including a practical convergence rate analysis, evidence of linear speedup with respect to the number of clusters, and insights into how computational delays can be harnessed to enhance the overall process.

Implementation Insights

Ravnest's implementation hinges on the precise formation of clusters. This involves gauging each participant's available resources (in this context, RAM and bandwidth) to create a harmonized collective capable of undertaking efficient distributed training. A genetic algorithm is described for sorting the nodes into clusters, which then adaptively manage joiners or leavers throughout the training. Additionally, fault-tolerant behaviors are incorporated to ensure continuity and reliability in a naturally erratic internet-based environment.

In conclusion, Ravnest's methodology offers an innovative avenue for distributed machine learning, alleviating the hardware barriers that often impede progress in this domain. It represents a considerable stride towards democratizing the development and training of high-caliber deep learning models.

PDF Markdown

Tweets

https://twitter.com/raven_protocol/status/1772304528190193896

https://twitter.com/raven_protocol/status/1745125663885963598

https://twitter.com/raven_protocol/status/1794131630203830577

https://twitter.com/erlingjohnsenDK/status/1745033505522786368

https://twitter.com/raven_protocol/status/1754436063202816430

https://twitter.com/kyledcollins/status/1772677567645929525