Parallel Restarted SGD with Faster Convergence and Less Communication: Demystifying Why Model Averaging Works for Deep Learning (1807.06629v3)

Published 17 Jul 2018 in math.OC, cs.DC, and cs.LG

Abstract: In distributed training of deep neural networks, parallel mini-batch SGD is widely used to speed up the training process by using multiple workers. It uses multiple workers to sample local stochastic gradient in parallel, aggregates all gradients in a single server to obtain the average, and update each worker's local model using a SGD update with the averaged gradient. Ideally, parallel mini-batch SGD can achieve a linear speed-up of the training time (with respect to the number of workers) compared with SGD over a single worker. However, such linear scalability in practice is significantly limited by the growing demand for gradient communication as more workers are involved. Model averaging, which periodically averages individual models trained over parallel workers, is another common practice used for distributed training of deep neural networks since (Zinkevich et al. 2010) (McDonald, Hall, and Mann 2010). Compared with parallel mini-batch SGD, the communication overhead of model averaging is significantly reduced. Impressively, tremendous experimental works have verified that model averaging can still achieve a good speed-up of the training time as long as the averaging interval is carefully controlled. However, it remains a mystery in theory why such a simple heuristic works so well. This paper provides a thorough and rigorous theoretical study on why model averaging can work as well as parallel mini-batch SGD with significantly less communication overhead.

Authors (3)

Hao Yu (195 papers)
Sen Yang (191 papers)
Shenghuo Zhu (29 papers)

Citations (568)

View on Semantic Scholar

Summary

Overview of Parallel Restarted SGD with Model Averaging

The paper "Parallel Restarted SGD with Faster Convergence and Less Communication: Demystifying Why Model Averaging Works for Deep Learning" by Hao Yu et al. addresses an efficiency issue in distributed training of deep neural networks, particularly regarding the communication overhead involved with parallel mini-batch SGD. The authors propose an alternative method called Parallel Restarted SGD (PR-SGD), which employs a model averaging strategy to reduce communication frequency while maintaining convergence rates.

Key Contributions

The paper provides a theoretical foundation for the use of model averaging, a method where individual models trained on parallel workers are periodically averaged, reducing communication overhead compared to traditional parallel mini-batch SGD. Despite previous empirical success, the underlying theoretical reasons for its efficacy were not well understood.

Communication Reduction: PR-SGD significantly lowers communication rounds by synchronizing only every $I$ iterations, rather than at every step. This approach is theoretically shown to achieve similar convergence properties as parallel mini-batch SGD.
Convergence Analysis: The authors demonstrate that PR-SGD can achieve an $O(1/\sqrt{NT})$ convergence rate for non-convex optimization problems, such as those encountered in deep learning, given appropriate choices of the synchronization interval $I$ .
Linear Speedup: Theoretical analysis indicates that PR-SGD retains a linear speedup in terms of the number of workers $N$ , a desirable property in distributed systems.

Strong Numerical Results

The paper supports its claims with a robust theoretical analysis showing that by setting $I = O(T^{1/4}/N^{3/4})$ , PR-SGD achieves the same convergence rate as parallel mini-batch SGD.
The presented numerical results in the experiment section validate the theoretical findings, demonstrating that PR-SGD achieves superior performance in terms of training loss and test accuracy without a proportional increase in communication costs.

Implications and Future Directions

The results have significant implications for distributed deep learning systems, pointing toward more efficient training methods that conserve critical bandwidth resources. Given the limitations of communication in large-scale distributed systems, this methodology can substantially enhance the scalability of training architectures.

Future research may focus on further refining the synchronization interval and learning rates to optimize convergence and efficiency in more diverse settings. Additionally, exploring asynchronous or adaptive communication strategies could lead to further advancements in distributed optimization methods.

Conclusion

The paper presents a rigorous theoretical and empirical analysis of the model averaging technique in distributed deep learning. By providing an understanding of why and how model averaging reduces communication overhead while maintaining convergence performance, this work contributes an important insight into distributed machine learning practices. The framework established can be a foundation for future exploration into efficient distributed machine learning systems.

PDF Markdown

Related Papers

Find Related Papers