On the Convergence of FedAvg on Non-IID Data (1907.02189v4)

Published 4 Jul 2019 in stat.ML, cs.LG, and math.OC

Abstract: Federated learning enables a large amount of edge computing devices to jointly learn a model without data sharing. As a leading algorithm in this setting, Federated Averaging (\texttt{FedAvg}) runs Stochastic Gradient Descent (SGD) in parallel on a small subset of the total devices and averages the sequences only once in a while. Despite its simplicity, it lacks theoretical guarantees under realistic settings. In this paper, we analyze the convergence of \texttt{FedAvg} on non-iid data and establish a convergence rate of $\mathcal{O}(\frac{1}{T})$ for strongly convex and smooth problems, where $T$ is the number of SGDs. Importantly, our bound demonstrates a trade-off between communication-efficiency and convergence rate. As user devices may be disconnected from the server, we relax the assumption of full device participation to partial device participation and study different averaging schemes; low device participation rate can be achieved without severely slowing down the learning. Our results indicate that heterogeneity of data slows down the convergence, which matches empirical observations. Furthermore, we provide a necessary condition for \texttt{FedAvg} on non-iid data: the learning rate $\eta$ must decay, even if full-gradient is used; otherwise, the solution will be $\Omega (\eta)$ away from the optimal.

View on arXiv

Authors (5)

Xiang Li (1003 papers)
Kaixuan Huang (70 papers)
Wenhao Yang (30 papers)
Shusen Wang (35 papers)
Zhihua Zhang (118 papers)

Citations (2,100)

View on Semantic Scholar

Summary

Essay on the Convergence of FedAvg on Non-IID Data

The paper "On the Convergence of FedAvg on Non-IID Data" thoroughly investigates the convergence properties of the Federated Averaging (FedAvg) algorithm under non-IID data distributions. The paper is a significant contribution to the field of Federated Learning (FL), addressing the theoretical gaps and providing comprehensive analyses that outline both the strengths and potential limitations of FedAvg in practical applications.

Summary of Key Contributions

The paper's primary contributions include:

Convergence Guarantees: Establishes the convergence rate of $\mathcal{O}(\frac{1}{T})$ for FedAvg in strongly convex and smooth settings without IID assumptions or full device participation. This is demonstrated for two distinct sampling schemes, denoted as Scheme I and Scheme II, and compares their effectiveness.
Trade-off Analysis: Theoretical insights reveal a trade-off between communication efficiency and the convergence rate, where neither overly small nor exceedingly large $E$ (local SGD steps between communications) proves optimal.
Algorithmic Innovations: Provides new sampling and averaging schemes that ensure convergence. The paper suggests Scheme I, which shows better stability and performance under practical settings.
Empirical Verification: Conducts experiments on both real and synthetic datasets, successfully verifying the theoretical findings. The experiments highlight the importance of appropriate sampling and averaging strategies for optimal algorithm performance.
Necessity of Learning Rate Decay: Demonstrates that a decaying learning rate is essential for FedAvg's convergence in non-IID settings, even with E > 1 local updates. Fixed learning rates were shown to result in suboptimal convergence.

Analytical Approach

Notations and Preliminary Assumptions

The authors define clear notations and set preliminary assumptions that form the foundation of their analysis. They consider strongly convex and smooth functions, bounded variance of stochastic gradients, and bounded norms of these gradients. These assumptions are crucial to derive meaningful analytical results.

Theorems and Convergence Proofs

Theorem on Full Device Participation (Theorem 1): This theorem demonstrates that the FedAvg algorithm, under full device participation, achieves a convergence rate of $\mathcal{O}(\frac{1}{T})$ . The theoretical proof integrates lemmas bounding gradient variances and divergence of local models.
Theorems on Partial Device Participation (Theorem 2 and Theorem 3): These theorems extend the analysis to scenarios with partial device participation, a more practical setting given computational and communication constraints. Scheme I (with replacement) and Scheme II (without replacement) are proposed and analyzed. Their convergence rates are rigorously derived, establishing that appropriately designed partial participation strategies can ensure effective and efficient learning.
Results on Learning Rate Decay (Theorem 4): An essential theoretical insight provided is the need for a decaying learning rate. This is crucial as otherwise, FedAvg risks converging to suboptimal points.

Practical Implications and Future Directions

The theoretical results have several practical implications. They suggest that for efficient FL deployments:

The choice of $E$ must be balanced to optimize communication efficiency and convergence.
Partial participation is viable and can save communication costs without severely impacting convergence.
A decaying learning rate is necessary to avoid convergence to suboptimal solutions.

Future research could explore exploring adaptive strategies for dynamically adjusting $E$ and $K$ based on the real-time performance of the system. Additionally, extending these analyses to non-convex settings, which are prevalent in deep learning, is a potential direction for further research.

Empirical Validation

Empirical studies conducted on MNIST and two synthetic datasets adhered to rigorous experimental protocols. They validate the theoretical assertions that the novel averaging schemes contribute significantly to the stability and performance of FedAvg. Importantly, the experiments affirm that Schemes I and II provide robust convergence, with Scheme I demonstrating superior stability under practical heterogeneous data distributions.

Conclusion

This paper fundamentally advances our understanding of FedAvg in non-IID settings, establishing critical convergence guarantees and practical sampling schemes. Importantly, it bridges the gap between theoretical rigour and practical applicability in federated learning.

By meticulously proving the necessity of specific algorithmic choices, such as learning rate decay, and validating these choices empirically, the authors provide a solid foundation for future advancements in FL algorithms, ensuring that they can be reliably deployed in real-world applications.

PDF Markdown

Related Papers

Find Related Papers