Asynchronous Decentralized Parallel Stochastic Gradient Descent (1710.06952v3)

Published 18 Oct 2017 in math.OC, cs.LG, and stat.ML

Abstract: Most commonly used distributed machine learning systems are either synchronous or centralized asynchronous. Synchronous algorithms like AllReduce-SGD perform poorly in a heterogeneous environment, while asynchronous algorithms using a parameter server suffer from 1) communication bottleneck at parameter servers when workers are many, and 2) significantly worse convergence when the traffic to parameter server is congested. Can we design an algorithm that is robust in a heterogeneous environment, while being communication efficient and maintaining the best-possible convergence rate? In this paper, we propose an asynchronous decentralized stochastic gradient decent algorithm (AD-PSGD) satisfying all above expectations. Our theoretical analysis shows AD-PSGD converges at the optimal $O(1/\sqrt{K})$ rate as SGD and has linear speedup w.r.t. number of workers. Empirically, AD-PSGD outperforms the best of decentralized parallel SGD (D-PSGD), asynchronous parallel SGD (A-PSGD), and standard data parallel SGD (AllReduce-SGD), often by orders of magnitude in a heterogeneous environment. When training ResNet-50 on ImageNet with up to 128 GPUs, AD-PSGD converges (w.r.t epochs) similarly to the AllReduce-SGD, but each epoch can be up to 4-8X faster than its synchronous counterparts in a network-sharing HPC environment. To the best of our knowledge, AD-PSGD is the first asynchronous algorithm that achieves a similar epoch-wise convergence rate as AllReduce-SGD, at an over 100-GPU scale.

PDF Abstract

Asynchronous Decentralized Parallel Stochastic Gradient Descent: A Summary

The paper on Asynchronous Decentralized Parallel Stochastic Gradient Descent (AD-PSGD) aims to address significant challenges associated with distributed machine learning systems, particularly regarding efficiency and robustness in heterogeneous computational environments. Conventional distributed systems primarily operate in synchronous or centralized asynchronous modes. Synchronous methods, such as AllReduce-SGD, are hampered by inefficiency in heterogeneous settings due to the synchronization bottleneck. Conversely, centralized asynchronous methods can suffer from congestion and poor scalability due to traffic bottlenecks at parameter servers.

The core contribution of the paper is the introduction of AD-PSGD, a novel algorithm designed to circumvent these problems via a decentralized communication framework and asynchronous updates. AD-PSGD operates optimally at $O(1/\sqrt{K})$ convergence rates, akin to classical SGD, and exhibits linear speedup relative to the number of workers.

Theoretical Foundations and Algorithmic Design

AD-PSGD is derived from a solid theoretical foundation, leveraging insights from stochastic gradient descent (SGD) and extending these to a decentralized, asynchronous setting. The paper establishes that AD-PSGD maintains the same convergence rate as traditional SGD under non-convex objectives, achieving the theoretical bound of $O(1/\sqrt{K})$ where $K$ represents the number of updates. The algorithm allows for computational workers not to wait for each other, reducing idle time and enhancing throughput.

The architecture of AD-PSGD involves each worker maintaining a local model, computing gradients independently using local data, and performing updates asynchronously, that is, without waiting for a synchronization signal. Communication is decentralized, with each worker interacting only with a subset of neighboring nodes in the network, substantially reducing central bottlenecks. The algorithm's matrix convergence properties and bounded staleness ensure that even in the presence of delayed updates, the model converges effectively.

Empirical Results

The empirical studies focus on evaluating AD-PSGD's performance across various metrics such as speedup, robustness, and convergence speed relative to state-of-the-art algorithms like AllReduce-SGD, D-PSGD, and EAMSGD. The experiments conducted on different hardware setups, including an IBM S822LC cluster and an X86-cluster, exhibit AD-PSGD's superior performance in heterogeneous environments. Notably, the algorithm outperforms other methods in scenarios where computational speeds or network conditions vary, highlighting its robust nature. Results from training models such as ResNet-50 on ImageNet and VGG on CIFAR-10 show that AD-PSGD converges significantly faster than its peers, often by several magnitude orders, under heterogeneous conditions.

Implications and Future Directions

AD-PSGD holds significant potential in practical distributed machine learning deployments, particularly within environments marked by variability in task execution speed and network stability. The methodological innovation allows for the efficient scaling of deep learning training across large numbers of GPUs without the traditional drawbacks of synchronization or parameter server congestion. The algorithm’s adaptability and performance suggest a robust framework for extending distributed learning tasks in research and commercial contexts involving high-dimensional data and complex model architectures.

Future research directions could explore further optimization of the communication topology to enhance convergence rates. Additionally, integrating adaptive mechanisms to dynamically adjust the worker interactions based on real-time network and computation analytics could further improve efficiency. Lastly, the development of enhanced protocols to ensure security and privacy in decentralized training environments remains a pertinent area for ongoing exploration.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Xiangru Lian (18 papers)
Wei Zhang (1489 papers)
Ce Zhang (215 papers)
Ji Liu (285 papers)

Citations (469)

View on Semantic Scholar

Asynchronous Decentralized Parallel Stochastic Gradient Descent (1710.06952v3)

Asynchronous Decentralized Parallel Stochastic Gradient Descent: A Summary

Theoretical Foundations and Algorithmic Design

Empirical Results

Implications and Future Directions

Related Papers