Papers
Topics
Authors
Recent
Search
2000 character limit reached

FedAvg Convergence on Non-IID Data

Updated 4 February 2026
  • The paper shows that non-IID data induces client drift, penalizing per-round descent and establishing an asymptotic bias floor in FedAvg.
  • It explores algorithmic modifications—like dynamic weighting, momentum, and selective client participation—that counteract heterogeneity-induced convergence slowdowns.
  • Empirical evidence indicates that strategies such as overparameterized models and balanced local epochs can nearly match centralized training performance under non-IID conditions.

Federated Averaging (FedAvg) is the canonical optimization protocol for cooperative model training under data locality constraints. The central challenge in federated settings is that client data distributions are typically non-IID—statistically heterogeneous—which introduces client drift and impedes global convergence. The convergence of FedAvg in such regimes has been the focus of extensive theoretical and empirical research, revealing inherent limitations, algorithmic modifications to counter heterogeneity-induced drift, and asymptotic regimes where FedAvg can nearly match centralized training.

1. Formal Problem Setting, Assumptions, and Source of Non-IID Effects

Federated optimization targets the minimization of a global objective of the general form: F(w)=k=1NpkFk(w),k=1Npk=1,F(w) = \sum_{k=1}^N p_k F_k(w), \qquad \sum_{k=1}^N p_k = 1, where each Fk(w)F_k(w) encodes the local empirical or population risk on client kk, and pkp_k is a weighting proportional to local data volume. Non-IIDness refers to the case where the distributions underlying the FkF_k are non-identical, causing disparate gradient signals across clients.

Standard theoretical frameworks for analyzing FedAvg convergence in this regime assume:

  • LL-smoothness: Each FkF_k and hence FF satisfies Fk(w)Fk(w)Lww\|\nabla F_k(w) - \nabla F_k(w')\|\leq L\|w-w'\|.
  • Strong convexity (μ\mu): FF is often assumed to be μ\mu–strongly convex to enable O(1/T)O(1/T) convergence guarantees, but results also exist for general convex and nonconvex FF.
  • Stochastic gradient noise: E[gkt]=Fk(wt)\mathbb{E}[g_k^t] = \nabla F_k(w_t), EgktFk(wt)2σk2\mathbb{E}\|g_k^t - \nabla F_k(w_t)\|^2 \leq \sigma_k^2.

A central heterogeneity parameter is the gradient dissimilarity or client drift term: Γ(t):=k=1NpkFk(wt)2F(wt)2\Gamma(t) := \sum_{k=1}^N p_k \| \nabla F_k(w_t) \|^2 - \| \nabla F(w_t) \|^2 which vanishes in the IID limit and dominates the convergence penalty as heterogeneity increases (Muhebwa et al., 26 May 2025, Li et al., 2019).

2. Convergence Analysis of Standard FedAvg on Heterogeneous Data

The basic FedAvg iteration for round tt is: wt+1=wtηtk=1Npkgktw_{t+1} = w_t - \eta_t \sum_{k=1}^N p_k g_k^t

Thus, the descent per round is penalized by Γ(t)\Gamma(t), which can dominate under severe heterogeneity.

where EE is the number of local SGD steps per round (see Section 3).

  • For fixed η>0\eta>0 and E>1E>1, the algorithm admits a bias floor: FedAvg provably does not converge to the exact minimizer, with excess error lower bounded by Ω(η(E1))\Omega(\eta(E-1)).
  • To reach error ϵ\epsilon, the number of communication rounds scales as O(1/ϵ)O(1/\epsilon) under only smoothness, even with decaying stepsizes (Zhang et al., 2020).
  • If neither gradient norms nor dissimilarity are controlled, FedAvg can diverge even on simple nonconvex or misaligned loss functions.

Heterogeneity-Driven Error Floor:

Several works formalize that even in the limit TT\to\infty, the limiting error admits a residual O(kpkδk2)O(\sum_k p_k \delta_k^2) for gradient divergence measures δk=supwFk(w)F(w)\delta_k = \sup_w \| \nabla F_k(w) - \nabla F(w) \| (Casella et al., 2023, Li et al., 2019, Wu et al., 2020).

Practical Implications:

  • Communication vs computation trade-off: too large EE amplifies drift, too small EE increases required communication (Li et al., 2019, Casella et al., 2023).
  • Oscillation and instability can occur when EE is large under severe non-IIDness.

3. Empirical Characterization and the Role of Statistical Heterogeneity

FedAvg’s empirical convergence is sensitive to the type and degree of non-IID data partitioning. Several benchmark studies (Casella et al., 2023, Wu et al., 2020) report:

  • Mild heterogeneity (e.g., Dirichlet, moderate label skew): Only moderate slowdowns in convergence; increasing the number of local epochs EE yields substantial reduction in communication rounds needed to achieve a target accuracy.
  • Severe heterogeneity (e.g., single-class per client, pathological label splits): Substantial slowdowns, lower asymptotic accuracy; excessive EE can cause oscillatory or divergent behavior, exacerbated drift, or outright failure to reach target performance within practical timeframes.

Experimental data (below, excerpted from (Wu et al., 2020) and (Casella et al., 2023)) illustrate this phenomenon:

Setting MNIST (95%) FashionMNIST (80%)
5 IID + 5 1-class non-IID (FedAvg) 133 222
6 IID + 4 1-class non-IID (FedAvg) 99 167

Rounds to accuracy threshold (FedAvg); increasing non-IIDness increases number of rounds required.

4. Algorithmic Modifications for Improving Convergence in Non-IID Settings

Numerous approaches have been developed to mitigate FedAvg’s heterogeneity-induced slowdowns:

  • Aggregation weights are assigned dynamically according to client phase alignment:

ρkt=sin(θˉtθkt)j=1Nsin(θˉtθjt)\rho_k^t = \frac{\sin(\bar\theta^t - \theta_k^t)}{\sum_{j=1}^N\sin(\bar\theta^t-\theta_j^t)}

where θkt\theta_k^t is the angle between the client's update and the mean update direction.

  • Kuramoto-FedAvg provably shrinks the drift penalty to ΓKur(t)<Γ(t)\Gamma_{\text{Kur}}(t) < \Gamma(t), tightening the standard FedAvg per-round descent and reducing required communication rounds.
  • Aggregation weights ψ~i(t)\widetilde\psi_i(t) are modulated per client based on the (smoothed) angle between local and global gradients:

ψ~i(t)Diexp(f(θ~i(t)))\widetilde\psi_i(t) \propto D_i \exp(f(\widetilde\theta_i(t)))

where ff is a downward nonlinear mapping (Gompertz-like) of client alignment.

  • Per-round decrease is provably improved: iρi(t)ψ~i(t)iρi(t)ψi\sum_i \rho_i(t)\,\widetilde\psi_i(t) \ge \sum_i \rho_i(t)\,{\psi}_i.
  • Adverse clients (identified via negative inner-product with global gradient) are demoted via reduced sampling probability in subsequent rounds.
  • Reduces weight divergence and improves convergence bounds over FedAvg.
  • Applies SNR-constrained compression and error feedback, yet achieves O(1/mKT+1/T)O(1/\sqrt{mKT} + 1/T) convergence rate.
  • Data heterogeneity enters via a variance-like term σg2\sigma_g^2, but is controlled as in uncompressed FedAvg.
  • Adding heavy-ball momentum stabilizes local trajectories and obviates the need for explicit heterogeneity bounds.
  • Achieves optimal O(1/NKR)O(1/\sqrt{N K R}) rates with constant stepsize, even under extreme drift.

5. Asymptotic Regimes: Overparameterization and the Vanishing Effect of Heterogeneity

Recent theory shows that in the overparameterized regime (very wide neural networks), the effect of data heterogeneity on FedAvg convergence—and generalization—diminishes polynomially in the width (Jian et al., 18 Aug 2025):

  • With fully-connected or convolutional networks of width nn \to \infty (infinite-width neural tangent kernel regime), the model divergence term ζ=O(n1/2)\zeta = O(n^{-1/2}) quantifies per-round drift.
  • In the limit, both local and global models linearize, training tracks centralized GD exactly, and FedAvg generalizes identically to pooled training for matched total update steps.
  • Empirically, the performance gap between IID and extreme non-IID settings vanishes as nn increases, confirmed across MNIST/CIFAR-10 and various architectures.

6. Fundamental Limits and Failure Modes

FedAvg's convergence on non-IID data is fundamentally limited by:

  • Uncontrolled drift: Without either bounded gradient norms or decaying stepsize, iterates may diverge, even for trivial non-IID constructions (Zhang et al., 2020).
  • Irreducible bias: The heterogeneity-driven term induces an asymptotic accuracy floor unless additional structure (e.g. strong convexity, bounded drift, overparameterization) is present.
  • Lower bounds: No protocol following the Computation-Then-Aggregation (CTA) template can break the O(1/ϵ)O(1/\epsilon) communication barrier without extra problem structure (Zhang et al., 2020).

Mitigation: Adaptive weighting, node selection, regularization (e.g. FedProx/FedCurv), momentum, or operating in the overparameterized limit can alleviate, but not entirely eliminate, these effects unless strong assumptions are met.

7. Practical Guidelines for Federated Optimization under Heterogeneity

  • Step-size scheduling: Use decaying stepsizes whenever drift or heterogeneity cannot be tightly bounded. Fixed stepsizes can be salvaged with momentum (Cheng et al., 2023).
  • Local epochs (EE): Moderate values (E510E \approx 5-10) balance communication and drift penalties. Excessive EE can amplify divergence in severe non-IID settings (Li et al., 2019, Casella et al., 2023).
  • Client selection and aggregation weighting: Favor strategies that dynamically suppress misaligned or adverse updates (Kuramoto-FedAvg, FedPNS, FedAdp).
  • Overparameterization: Widening networks quantitatively mitigates data heterogeneity penalties, recovering centralized performance (Jian et al., 18 Aug 2025).
  • Algorithmic stability: In settings with uncontrolled drift or high label skew, consider algorithmic extensions such as FedProx, SCAFFOLD, or primal-dual schemes.

In summary, while FedAvg admits O(1/T)O(1/T) or O(1/mKT)O(1/\sqrt{m K T}) rates under smoothness and convexity assumptions, its convergence in non-IID regimes is systematically hampered by heterogeneity-induced drift. Recent innovations—phase- or angle-based dynamic aggregation, momentum, and overparametric model scaling—demonstrate both theoretically and empirically that the drift penalty can be sharply reduced, but not entirely eliminated except in asymptotic regimes (Muhebwa et al., 26 May 2025, Wu et al., 2020, Jian et al., 18 Aug 2025, Cheng et al., 2023).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Convergence of FedAvg on Non-IID Data.