Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Variance Reduced Local SGD with Lower Communication Complexity (1912.12844v1)

Published 30 Dec 2019 in cs.LG, cs.DC, math.OC, and stat.ML

Abstract: To accelerate the training of machine learning models, distributed stochastic gradient descent (SGD) and its variants have been widely adopted, which apply multiple workers in parallel to speed up training. Among them, Local SGD has gained much attention due to its lower communication cost. Nevertheless, when the data distribution on workers is non-identical, Local SGD requires $O(T{\frac{3}{4}} N{\frac{3}{4}})$ communications to maintain its \emph{linear iteration speedup} property, where $T$ is the total number of iterations and $N$ is the number of workers. In this paper, we propose Variance Reduced Local SGD (VRL-SGD) to further reduce the communication complexity. Benefiting from eliminating the dependency on the gradient variance among workers, we theoretically prove that VRL-SGD achieves a \emph{linear iteration speedup} with a lower communication complexity $O(T{\frac{1}{2}} N{\frac{3}{2}})$ even if workers access non-identical datasets. We conduct experiments on three machine learning tasks, and the experimental results demonstrate that VRL-SGD performs impressively better than Local SGD when the data among workers are quite diverse.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Xianfeng Liang (5 papers)
  2. Shuheng Shen (10 papers)
  3. Jingchang Liu (6 papers)
  4. Zhen Pan (53 papers)
  5. Enhong Chen (242 papers)
  6. Yifei Cheng (8 papers)
Citations (149)