On the Convergence of Local Descent Methods in Federated Learning (1910.14425v2)

Published 31 Oct 2019 in cs.LG, cs.DC, and stat.ML

Abstract: In federated distributed learning, the goal is to optimize a global training objective defined over distributed devices, where the data shard at each device is sampled from a possibly different distribution (a.k.a., heterogeneous or non i.i.d. data samples). In this paper, we generalize the local stochastic and full gradient descent with periodic averaging-- originally designed for homogeneous distributed optimization, to solve nonconvex optimization problems in federated learning. Although scant research is available on the effectiveness of local SGD in reducing the number of communication rounds in homogeneous setting, its convergence and communication complexity in heterogeneous setting is mostly demonstrated empirically and lacks through theoretical understating. To bridge this gap, we demonstrate that by properly analyzing the effect of unbiased gradients and sampling schema in federated setting, under mild assumptions, the implicit variance reduction feature of local distributed methods generalize to heterogeneous data shards and exhibits the best known convergence rates of homogeneous setting both in general nonconvex and under {\pl}~ condition (generalization of strong-convexity). Our theoretical results complement the recent empirical studies that demonstrate the applicability of local GD/SGD to federated learning. We also specialize the proposed local method for networked distributed optimization. To the best of our knowledge, the obtained convergence rates are the sharpest known to date on the convergence of local decant methods with periodic averaging for solving nonconvex federated optimization in both centralized and networked distributed optimization.

PDF Abstract

Convergence of Local Descent Methods in Federated Learning

The paper "On the Convergence of Local Descent Methods in Federated Learning" by Haddadpour and Mahdavi analyzes the convergence characteristics of local gradient descent (GD) and local stochastic gradient descent (SGD) methods in federated learning environments. Federated learning is characterized by the presence of distributed devices, each with its data, which may be derived from different distributions, making conventional i.i.d. (independent and identically distributed) assumptions invalid. This diversity leads to challenges in optimization, particularly concerning communication efficiency and convergence.

Key Contributions

Generalization to Non-i.i.d. Settings: The authors generalize the periodic averaging approach of local SGD, previously applied in homogeneous settings, to heterogeneous data shards found in federated learning. They theorize that even with such data diversity, local methods maintain convergence capabilities when parameters are appropriately set concerning gradient diversity.
Convergence Analysis: The proposed algorithms' convergence rates are scrutinized for both non-convex objectives and non-convex objectives satisfying the Polyak-Lojasiewicz (PL) condition. For instance:
- Local GD: With sufficient tuning of learning rates and local update iterations, convergence can mimic distributed GD asymptotically without significant dependence on the number of devices.
- Local SGD: In non-convex problems under the PL condition, they improve convergence rates from $O(E^2/T)$ in existing literature to $O(1/KT)$ , significantly affecting computational feasibilities like larger local updates and faster convergence for fewer communication rounds.
Networked Settings: The paper extends to networked local SGD schemes where devices communicate locally with neighbors. Analyzing such decentralized communication pathways provides a realistic perspective in federated configurations.

Numerical and Empirical Considerations

The theoretical convergence rates differ according to assumptions about gradient diversity. These results align with past work in demonstrating that local gradient methods, with their inherent variance reduction, are effective in federated environments, especially with careful parameter management. Remarkably, even without explicit variance reduction strategies, local SGD maintains competitive convergence rates against more complex methods.

Assumptions and Implications

A few assumptions are threaded throughout the analysis:

Gradient Diversity Boundedness: The boundedness assumption of gradient diversity is crucial, without which convergence may not be guaranteed, necessitating careful empirical justification.
Sampling Schema: In centralized environments, device selection plays a pivotal role in convergence, whereas it's less of a concern in the decentralized models due to redundancy via communications.

Practical and Theoretical Implications

The theoretical insights help quantify the permissible level of asynchrony and communicate frequencies in federated setups, bolstering the broader adoption of federated learning in real-world distributed systems. Moreover, the highlight on adaptive strategies (to gradient diversity) opens avenues for dynamic federated learning systems that optimize both privacy considerations and computational demands.

Future Directions

Given these findings, future work might involve:

Extensive empirical testing to verify the robustness of these theoretical models across a variety of federations and data distributions.
Exploring adaptive synchronization techniques to dynamically alter update strategies, thus potentially further reducing communication overhead.
Consideration of fairness and privacy impacts on federated learning convergence and efficiency.

Overall, this paper provides a crucial step towards understanding and improving federated learning's practical challenges, particularly around convergence and efficiency in variable data conditions. It delivers a foundational direction for leveraging local descent methods' potential while cautioning about careful parameter tuning reflective of gradient diversity nuances.

PDF Markdown Bookmark Chat (Pro)

Authors (2)

Farzin Haddadpour (14 papers)
Mehrdad Mahdavi (50 papers)

Citations (258)

View on Semantic Scholar