D$^2$: Decentralized Training over Decentralized Data (1803.07068v2)

Published 19 Mar 2018 in cs.DC, cs.LG, and stat.ML

Abstract: While training a machine learning model using multiple workers, each of which collects data from their own data sources, it would be most useful when the data collected from different workers can be {\em unique} and {\em different}. Ironically, recent analysis of decentralized parallel stochastic gradient descent (D-PSGD) relies on the assumption that the data hosted on different workers are {\em not too different}. In this paper, we ask the question: {\em Can we design a decentralized parallel stochastic gradient descent algorithm that is less sensitive to the data variance across workers?} In this paper, we present D$^2$, a novel decentralized parallel stochastic gradient descent algorithm designed for large data variance \xr{among workers} (imprecisely, "decentralized" data). The core of D$^2$ is a variance blackuction extension of the standard D-PSGD algorithm, which improves the convergence rate from $O\left({\sigma \over \sqrt{nT}} + {(n\zeta^{2)^{{\frac{1}{3}}}} \over T^{{2/3}}\right)$} to $O\left({\sigma \over \sqrt{nT}}\right)$ where $\zeta^{2}$ denotes the variance among data on different workers. As a result, D$^2$ is robust to data variance among workers. We empirically evaluated D$^2$ on image classification tasks where each worker has access to only the data of a limited set of labels, and find that D$^2$ significantly outperforms D-PSGD.

Citations (336)

View on Semantic Scholar

Summary

The paper introduces a variance reduction mechanism that improves D-PSGD performance on heterogeneous data.
It integrates local gradient updates with previous iterations to effectively reduce inter-worker variance.
Experiments show that D2 nearly matches centralized SGD performance in decentralized image classification tasks.

Overview of D $^2$ : Decentralized Training over Decentralized Data

This paper presents D $^2$ , a novel algorithm designed to enhance the efficacy of decentralized parallel stochastic gradient descent (D-PSGD) in scenarios where data distribution among workers exhibits substantial variability. Standard approaches in decentralized network training assume limited variance across datasets managed by different workers; however, D $^2$ addresses the practical challenge where datasets are significantly divergent, a situation often encountered in real-world distributed machine learning tasks.

Core Contributions

D $^2$ introduces a variance reduction mechanism that augments the conventional D-PSGD algorithm. This enhancement results in an improved convergence rate from $O\left({\sigma \over \sqrt{nT} + {(n\zeta^2)^{\frac{1}{3}} \over T^{2/3}}\right)$ to $O\left({\sigma \over \sqrt{nT}}\right)$ , effectively mitigating the performance degradation typically caused by high inter-worker data variance. Notably, $\zeta^2$ quantifies the variance among datasets maintained by different workers, while $\sigma^2$ pertains to the variance within each worker's local dataset.

Methodological Insights

The D $^2$ algorithm integrates variance reduction techniques within the decentralized optimization framework. Each worker independently computes local stochastic gradients and amalgamates these with the previous iteration's data to refine local models, thereby obliterating inconsistencies in data variations at the convergence point. This methodological innovation is crucial in minimizing the adverse impacts of dataset heterogeneity across distributed networks.

Empirical Validation

The empirical superiority of D $^2$ is validated through image classification experiments where workers train models using a limited set of image labels. In these settings, D $^2$ consistently outperforms traditional D-PSGD, nearing the performance levels of centralized SGD even amidst significant data variance.

Theoretical and Practical Implications

Theoretically, this work advances our understanding of decentralized optimization by eliminating the previously required assumptions about bounded dataset variance across workers. Practically, the results imply that decentralized machine learning models can now achieve faster convergence rates and higher performance without needing homogenous datasets, a common limitation in real-world applications like federated learning, where latency and privacy concerns necessitate decentralized approaches.

Future Directions

Looking ahead, the principles of variance reduction in decentralized settings explored in D $^2$ could extend to broader applications involving non-convex optimization and other forms of stochastic gradient descent. Additionally, exploring scalability and robustness in more complex network architectures could offer valuable insights for deploying machine learning solutions in diverse, large-scale environments.

In summary, the D $^2$ algorithm represents a significant stride in decentralized algorithm design, offering both theoretical advances and practical applications in addressing data heterogeneity challenges in distributed machine learning frameworks.

PDF Markdown