Accelerating Gossip SGD with Periodic Global Averaging (2105.09080v1)

Published 19 May 2021 in cs.LG and cs.DC

Abstract: Communication overhead hinders the scalability of large-scale distributed training. Gossip SGD, where each node averages only with its neighbors, is more communication-efficient than the prevalent parallel SGD. However, its convergence rate is reversely proportional to quantity $1-\beta$ which measures the network connectivity. On large and sparse networks where $1-\beta \to 0$, Gossip SGD requires more iterations to converge, which offsets against its communication benefit. This paper introduces Gossip-PGA, which adds Periodic Global Averaging into Gossip SGD. Its transient stage, i.e., the iterations required to reach asymptotic linear speedup stage, improves from $\Omega(\beta⁴ n^{3/(1-\beta)^4)$} to $\Omega(\beta⁴ n³ H^4)$ for non-convex problems. The influence of network topology in Gossip-PGA can be controlled by the averaging period $H$. Its transient-stage complexity is also superior to Local SGD which has order $\Omega(n³ H^4)$. Empirical results of large-scale training on image classification (ResNet50) and LLMing (BERT) validate our theoretical findings.

Citations (41)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Accelerating Gossip SGD with Periodic Global Averaging (2105.09080v1)

Summary

Related Papers