Convergence Analysis of Federated Optimization

Updated 25 February 2026

The paper outlines convergence guarantees for synchronous, asynchronous, and proximal federated optimization methods under non-IID data and system heterogeneity.
A comprehensive analysis shows that local gradient drift, client selection bias, and staleness critically affect convergence rates and necessitate tailored algorithmic adjustments.
Empirical and theoretical insights reveal that adaptive, momentum-based, and hybrid strategies can achieve near-linear convergence under appropriate constraints.

Federated optimization refers to a collection of decentralized algorithms that solve empirical risk minimization or related objectives by leveraging local computation and coordinating updates across clients via a central server. Convergence analysis in this domain addresses the rate and robustness with which such algorithms approach a stationary point or minimizer of the global objective, under significant challenges such as non-IID data, system heterogeneity, partial/asynchronous participation, communication constraints, and synchronization delays. The literature has produced a diverse set of algorithmic paradigms—synchronous and asynchronous, proximal and momentum-accelerated, adaptive, and hybrid approaches—each with its own convergence guarantees, assumptions, and trade-offs.

1. Fundamental Models and Assumptions

Federated optimization is typically framed as the minimization of a global objective over parameter $x\in\mathbb{R}^d$ ,

$F(x) = \frac{1}{n} \sum_{i=1}^{n} \mathbb{E}_{z\sim \mathcal{D}_i}[f(x;z)],$

where device $i$ possesses data sampled from its local distribution $\mathcal{D}_i$ (Xie et al., 2019). Analysis often relies on conventional assumptions:

L-smoothness: $f$ or $F$ has $L$ -Lipschitz gradients.
(Strong/weak) convexity: $f(x) + (\mu/2)\|x\|^2$ is convex for $\mu\ge 0$ , permitting treatment of convex and certain nonconvex problems.
Bounded stochastic gradients: $\|\nabla f(x;z)\|^2 \le V_1$ , $\|\nabla g_{x'}(x;z)\|^2 \le V_2$ .
Participation and local work constraints: Bounded staleness $K$ and local epoch imbalance $\delta=H_{\max}/H_{\min}$ .
Data/statistical heterogeneity: Quantified via parameters such as bounded dissimilarity $B_\epsilon$ or local-global gap $\gamma$ (Li et al., 2018, Cho et al., 2020).

Non-IID data, variable participation, and unbalanced local computation all result in local gradient drift and slow or even divergent behavior unless suitably controlled by the method or included in the theoretical analysis.

2. Synchronous, Asynchronous, and Proximal Paradigms

Synchronous Approaches: FedAvg, FedProx

FedAvg averages local SGD iterates with fixed or random client participation, typically without explicit mitigation for client drift. Its canonical convergence for $\mu$ -strongly convex, smooth $f$ , full participation, and constant step size is sublinear; with partial participation and non-IID data, local heterogeneity leads to an error floor or decelerated rate (Li et al., 2018).

FedProx introduces a local proximal term: $w_k^{t+1} \approx \arg\min_w \{ F_k(w) + \frac{\mu}{2}\|w-w^t\|^2 \},$ allowing for variable inexactness and explicitly controlling local updates (Li et al., 2018). Theorem 1 shows that, under $L$ -smoothness, Hessian lower bound $L_-$ , and bounded dissimilarity $B$ , one obtains per-round functional descent: $\mathbb{E}[f(w^{t+1})] \le f(w^t) - \rho\|\nabla f(w^t)\|^2,$ with rate parameter $\rho$ determined jointly by $\mu$ , $\gamma$ (inexactness), $B$ , and $K$ (cohort size). Stationarity bounds and, in the convex case, $O(1/t)$ convergence to optimality hold as long as local errors and dissimilarity are not too large. The proximal anchor is critical in preventing divergence in regimes where FedAvg fails (Li et al., 2018, Yuan et al., 2022).

Asynchronous Optimization: FedAsync

FedAsync allows clients to operate without strict synchronization. Each device $i$ pulls a possibly stale global model $x_t$ , solves for SGD iterates with a proximal regularization $g_{x_t}(x;z)$ , and sends updates back to the server. The server applies an update with a staleness-corrected weight $\alpha_t = \alpha\, s(t-\tau)$ (Xie et al., 2019). Under $L$ -smoothness, weak convexity, bounded staleness $K$ , and suitable parameters, the minimal expected gradient norm achieves near-linear reduction in the number of global updates: $\min_{0 \leq t < T} \mathbb{E}[\|\nabla F(x_t)\|^2] \le \frac{F(x_0) - F(x_T)}{\alpha\gamma\epsilon T H_{\min}} + \text{(staleness and variance terms)},$ with explicit dependence on local work imbalance and staleness. Empirically, FedAsync achieves fast convergence even under moderate delays, and adaptive mixing weights greatly stabilize updates under staleness (Xie et al., 2019).

Proximal and Stability-Based Extensions

Recent work expands FedProx analysis using algorithmic stability to weaken or remove local gradient dissimilarity assumptions and include non-smooth, weakly convex settings and minibatching (Yuan et al., 2022). For example, the minimax bound: $\mathbb{E}\|\nabla \bar{R}(w_{t^*})\|^2 \lesssim (L\Delta_{\text{erm}} + G^2) \max\{ T^{-2/3}, (TI)^{-1/2} \},$ where $I$ is the participation or minibatch size, holds for $L$ -smooth, $G$ -Lipschitz losses without any $(B,0)$ -local gradient dissimilarity (LGD) control. Stationarity to $\epsilon$ is reached in $O(I^{-1}\epsilon^{-2})$ , providing linear speedup in the number of clients/minibatch size.

3. Partial Participation, Client Selection, and Participation Bias

In large-scale federated networks, full participation is impractical; only a subset participates each round.

Unbiased selection (FedAvg-style) corresponds to sampling each client with probability proportional to data size $p_k$ . If selection is unbiased, expected aggregate gradients remain unbiased and have no participation-induced asymptotic bias (Cho et al., 2020).

Biased selection (e.g., “power-of-choice” strategies that select clients with higher local loss) accelerates practical convergence but introduces a solution bias quantified by

$\mathrm{Bias} = \frac{8L}{3\mu}(\bar\rho-1),$

where $\bar\rho$ measures the selection skew. The main result for strongly-convex $F$ states that the optimization error decays as $O(1/T)$ up to this bias floor, which is absent in the unbiased case (Cho et al., 2020).

Selection strategies thus entail a speed/bias trade-off, with unbiased selection guaranteeing vanishing error, while power-of-choice offers faster empirical reduction at the cost of nonzero error floor. Empirical evidence suggests moderate bias yields up to $3\times$ acceleration with small accuracy degradation.

4. Acceleration, Adaptivity, and Hybrid Methods

Accelerated Momentum and Operator Splitting

Classical momentum and operator-splitting frameworks admit rigorous convergence analysis in specific regimes:

FedAvg+MaSS/Nesterov achieves exponential convergence under interpolation conditions (full participation and zero empirical risk): contraction of the form

$\mathbb{E}\|w_t-w^*\|^2 \le C(1-\alpha)^t,$

with $\alpha\approx 1/\sqrt{\kappa_m\tilde{\kappa}_m}$ reflecting local and statistical condition numbers (Qu et al., 2020).

FedSplit (operator splitting) ensures exactness—fixed points are optimal—under consensus constraints, and achieves linear contraction with rate $1-2/(\sqrt{\kappa}+1)$ for strongly convex, smooth objectives (Pathak et al., 2020). Even with inexact local prox solves, the residual is controlled and rapid convergence is retained.
FedHybrid allows each client to independently select gradient-type or Newton-type updates in a primal-dual algorithm, guaranteeing $Q$ -linear convergence, uniformly over such local heterogeneity (Niu et al., 2021). The contraction factor is determined by global strong convexity and the minimal local step-sizes.

Adaptive Federated Optimization

Federated adaptivity methods such as FedAdagrad, FedAdam, and FedYogi control the heterogeneity-induced drift inherent to unbalanced client participation and non-IID data. Standard FedAvg has a heterogeneity penalty scaling as $O(\zeta^2 K)$ (where $K$ is the number of local steps per round), while FedAdagrad reduces this to $O(\zeta^2 \sqrt{K/T})$ and FedAdam/FedYogi eliminate $K$ -dependence altogether (or merely $\log T$ drift for Yogi). Thus, adaptive optimizers facilitate larger $K$ (i.e., less frequent communication) without loss in convergence rate for heterogeneous regimes (Reddi et al., 2020).

5. Specialized Regimes: Asynchrony, Resource Constraints, Dropout, and Robustness

Asynchronous optimization demonstrates that staleness (with bounded $K$ ) slows convergence, but not catastrophically—FedAsync matches single-threaded SGD in the small-delay regime and is typically superior to synchronous FedAvg when delays are moderate (Xie et al., 2019).
Latency and resource-aware variants, such as FedDrop or SemiFL, account for communication/computation trade-off by dynamically adjusting per-device dropout rates, bandwidth allocation, or hybridizing FL with CL modes. Dropout inflates gradient variance by $\gamma/(1-\gamma)$ for dropout rate $\gamma$ , resulting in a slower $O(1/\sqrt{T})$ rate, with a clear latency-convergence balancing principle (Xie et al., 2024, Zheng et al., 2023).
Robust and adversarial settings (e.g., federated adversarial learning) provide convergence guarantees even for min-max, non-convex, and non-IID data minimax problems, provided suitable over-parameterization and careful coupling of local and global gradients (Li et al., 2022, Sharma et al., 2022). In these settings, local update bias and global stationarity require more advanced arguments, yet several works now achieve order-optimal sample complexity and linear speedup in the number of participants.
Cyclic and hierarchical participation models show that, under certain grouping and selection schemes (cyclic availability), one can outperform classic FedAvg's $O(1/T)$ rate, reaching $O(1/T^2)$ rates as in incremental gradient methods (Cho et al., 2023). Hierarchical federated optimization imposes additional consensus errors, tightly characterized in layered split models, and drives joint communication-system-optimization (Lin et al., 2024).

6. Methodological Tools and Convergence Proof Techniques

Modern convergence analyses for federated optimization employ a range of methodological tools:

Descent lemmas based on $L$ -smoothness and (weak-)convexity properties, coupled with telescoping sums over rounds.
Stability-based arguments to remove reliance on unrealistic data homogeneity (e.g., algorithmic stability controlling generalization error for local-prox solutions) (Yuan et al., 2022).
Operator-splitting and primal-dual theory to guarantee exactness and fast rates even with heterogeneous deterministic or stochastic updates (Pathak et al., 2020, Niu et al., 2021).
Lyapunov and potential functions coupling primal and dual variables, as in hybrid primal-dual frameworks, yielding system-wide $Q$ -linear contraction.

Proofs frequently require explicit handling of:

Gradient drift from local updates on statistically distinct data,
Staleness from asynchrony or random/biased/partial participation,
Inexactness due to adaptive methods or resource-constrained local solvers,
Bias floor induced by non-uniform sampling or selection strategies.

7. Empirical Corroboration and Practical Implications

Extensive empirical evaluations reinforce the theoretical findings:

FedAsync: Outperforms FedAvg in low-delay settings and is robust up to moderate staleness (Xie et al., 2019).
FedProx: Stabilizes training and prevents divergence under severe data and system heterogeneity, unlike FedAvg (Li et al., 2018).
Bias-aware selection: Power-of-choice methods empirically deliver up to $3\times$ faster convergence, with a bias floor aligned with theoretical predictions (Cho et al., 2020).
Proximal and adaptive methods: Minibatch and participation-size speedups are confirmed across multiple datasets (Yuan et al., 2022, Reddi et al., 2020).
Dropout and resource allocation: Joint optimization of dropout and communication parameters yields practical minimization of total training time subject to system constraints (Xie et al., 2024, Zheng et al., 2023).

In summary, convergence analysis of federated optimization identifies the precise interplay between algorithmic design, heterogeneity, participation, and resource constraints. Theoretical results now rigorously characterize the trade-offs and attainable rates across nearly all popular paradigms, empowering practitioners to match system designs to statistical and computational realities.