FedAvg Convergence on Non-IID Data

Updated 4 February 2026

The paper shows that non-IID data induces client drift, penalizing per-round descent and establishing an asymptotic bias floor in FedAvg.
It explores algorithmic modifications—like dynamic weighting, momentum, and selective client participation—that counteract heterogeneity-induced convergence slowdowns.
Empirical evidence indicates that strategies such as overparameterized models and balanced local epochs can nearly match centralized training performance under non-IID conditions.

Federated Averaging (FedAvg) is the canonical optimization protocol for cooperative model training under data locality constraints. The central challenge in federated settings is that client data distributions are typically non-IID—statistically heterogeneous—which introduces client drift and impedes global convergence. The convergence of FedAvg in such regimes has been the focus of extensive theoretical and empirical research, revealing inherent limitations, algorithmic modifications to counter heterogeneity-induced drift, and asymptotic regimes where FedAvg can nearly match centralized training.

1. Formal Problem Setting, Assumptions, and Source of Non-IID Effects

Federated optimization targets the minimization of a global objective of the general form: $F(w) = \sum_{k=1}^N p_k F_k(w), \qquad \sum_{k=1}^N p_k = 1,$ where each $F_k(w)$ encodes the local empirical or population risk on client $k$ , and $p_k$ is a weighting proportional to local data volume. Non-IIDness refers to the case where the distributions underlying the $F_k$ are non-identical, causing disparate gradient signals across clients.

Standard theoretical frameworks for analyzing FedAvg convergence in this regime assume:

$L$ -smoothness: Each $F_k$ and hence $F$ satisfies $\|\nabla F_k(w) - \nabla F_k(w')\|\leq L\|w-w'\|$ .
Strong convexity ( $\mu$ ): $F_k(w)$ 0 is often assumed to be $F_k(w)$ 1–strongly convex to enable $F_k(w)$ 2 convergence guarantees, but results also exist for general convex and nonconvex $F_k(w)$ 3.
Stochastic gradient noise: $F_k(w)$ 4, $F_k(w)$ 5.

A central heterogeneity parameter is the gradient dissimilarity or client drift term: $F_k(w)$ 6 which vanishes in the IID limit and dominates the convergence penalty as heterogeneity increases (Muhebwa et al., 26 May 2025, Li et al., 2019).

2. Convergence Analysis of Standard FedAvg on Heterogeneous Data

The basic FedAvg iteration for round $F_k(w)$ 7 is: $F_k(w)$ 8

Thus, the descent per round is penalized by $k$ 0, which can dominate under severe heterogeneity.

where $k$ 2 is the number of local SGD steps per round (see Section 3).

For fixed $k$ 3 and $k$ 4, the algorithm admits a bias floor: FedAvg provably does not converge to the exact minimizer, with excess error lower bounded by $k$ 5.
To reach error $k$ 6, the number of communication rounds scales as $k$ 7 under only smoothness, even with decaying stepsizes (Zhang et al., 2020).
If neither gradient norms nor dissimilarity are controlled, FedAvg can diverge even on simple nonconvex or misaligned loss functions.

Heterogeneity-Driven Error Floor:

Several works formalize that even in the limit $k$ 8, the limiting error admits a residual $k$ 9 for gradient divergence measures $p_k$ 0 (Casella et al., 2023, Li et al., 2019, Wu et al., 2020).

Practical Implications:

Communication vs computation trade-off: too large $p_k$ 1 amplifies drift, too small $p_k$ 2 increases required communication (Li et al., 2019, Casella et al., 2023).
Oscillation and instability can occur when $p_k$ 3 is large under severe non-IIDness.

3. Empirical Characterization and the Role of Statistical Heterogeneity

FedAvg’s empirical convergence is sensitive to the type and degree of non-IID data partitioning. Several benchmark studies (Casella et al., 2023, Wu et al., 2020) report:

Mild heterogeneity (e.g., Dirichlet, moderate label skew): Only moderate slowdowns in convergence; increasing the number of local epochs $p_k$ 4 yields substantial reduction in communication rounds needed to achieve a target accuracy.
Severe heterogeneity (e.g., single-class per client, pathological label splits): Substantial slowdowns, lower asymptotic accuracy; excessive $p_k$ 5 can cause oscillatory or divergent behavior, exacerbated drift, or outright failure to reach target performance within practical timeframes.

Experimental data (below, excerpted from (Wu et al., 2020) and (Casella et al., 2023)) illustrate this phenomenon:

Setting	MNIST (95%)	FashionMNIST (80%)
5 IID + 5 1-class non-IID (FedAvg)	133	222
6 IID + 4 1-class non-IID (FedAvg)	99	167

Rounds to accuracy threshold (FedAvg); increasing non-IIDness increases number of rounds required.

4. Algorithmic Modifications for Improving Convergence in Non-IID Settings

Numerous approaches have been developed to mitigate FedAvg’s heterogeneity-induced slowdowns:

Aggregation weights are assigned dynamically according to client phase alignment:

$p_k$ 6

where $p_k$ 7 is the angle between the client's update and the mean update direction.

Kuramoto-FedAvg provably shrinks the drift penalty to $p_k$ 8, tightening the standard FedAvg per-round descent and reducing required communication rounds.

Aggregation weights $p_k$ 9 are modulated per client based on the (smoothed) angle between local and global gradients:

$F_k$ 0

where $F_k$ 1 is a downward nonlinear mapping (Gompertz-like) of client alignment.

Per-round decrease is provably improved: $F_k$ 2.

Adverse clients (identified via negative inner-product with global gradient) are demoted via reduced sampling probability in subsequent rounds.
Reduces weight divergence and improves convergence bounds over FedAvg.

Applies SNR-constrained compression and error feedback, yet achieves $F_k$ 3 convergence rate.
Data heterogeneity enters via a variance-like term $F_k$ 4, but is controlled as in uncompressed FedAvg.

Adding heavy-ball momentum stabilizes local trajectories and obviates the need for explicit heterogeneity bounds.
Achieves optimal $F_k$ 5 rates with constant stepsize, even under extreme drift.

5. Asymptotic Regimes: Overparameterization and the Vanishing Effect of Heterogeneity

Recent theory shows that in the overparameterized regime (very wide neural networks), the effect of data heterogeneity on FedAvg convergence—and generalization—diminishes polynomially in the width (Jian et al., 18 Aug 2025):

With fully-connected or convolutional networks of width $F_k$ 6 (infinite-width neural tangent kernel regime), the model divergence term $F_k$ 7 quantifies per-round drift.
In the limit, both local and global models linearize, training tracks centralized GD exactly, and FedAvg generalizes identically to pooled training for matched total update steps.
Empirically, the performance gap between IID and extreme non-IID settings vanishes as $F_k$ 8 increases, confirmed across MNIST/CIFAR-10 and various architectures.

6. Fundamental Limits and Failure Modes

FedAvg's convergence on non-IID data is fundamentally limited by:

Uncontrolled drift: Without either bounded gradient norms or decaying stepsize, iterates may diverge, even for trivial non-IID constructions (Zhang et al., 2020).
Irreducible bias: The heterogeneity-driven term induces an asymptotic accuracy floor unless additional structure (e.g. strong convexity, bounded drift, overparameterization) is present.
Lower bounds: No protocol following the Computation-Then-Aggregation (CTA) template can break the $F_k$ 9 communication barrier without extra problem structure (Zhang et al., 2020).

Mitigation: Adaptive weighting, node selection, regularization (e.g. FedProx/FedCurv), momentum, or operating in the overparameterized limit can alleviate, but not entirely eliminate, these effects unless strong assumptions are met.

7. Practical Guidelines for Federated Optimization under Heterogeneity

Step-size scheduling: Use decaying stepsizes whenever drift or heterogeneity cannot be tightly bounded. Fixed stepsizes can be salvaged with momentum (Cheng et al., 2023).
Local epochs ( $L$ 0): Moderate values ( $L$ 1) balance communication and drift penalties. Excessive $L$ 2 can amplify divergence in severe non-IID settings (Li et al., 2019, Casella et al., 2023).
Client selection and aggregation weighting: Favor strategies that dynamically suppress misaligned or adverse updates (Kuramoto-FedAvg, FedPNS, FedAdp).
Overparameterization: Widening networks quantitatively mitigates data heterogeneity penalties, recovering centralized performance (Jian et al., 18 Aug 2025).
Algorithmic stability: In settings with uncontrolled drift or high label skew, consider algorithmic extensions such as FedProx, SCAFFOLD, or primal-dual schemes.

In summary, while FedAvg admits $L$ 3 or $L$ 4 rates under smoothness and convexity assumptions, its convergence in non-IID regimes is systematically hampered by heterogeneity-induced drift. Recent innovations—phase- or angle-based dynamic aggregation, momentum, and overparametric model scaling—demonstrate both theoretically and empirically that the drift penalty can be sharply reduced, but not entirely eliminated except in asymptotic regimes (Muhebwa et al., 26 May 2025, Wu et al., 2020, Jian et al., 18 Aug 2025, Cheng et al., 2023).

Markdown Report Issue Upgrade to Chat

References (9)

Kuramoto-FedAvg: Using Synchronization Dynamics to Improve Federated Learning Optimization under Statistical Heterogeneity (2025)

On the Convergence of FedAvg on Non-IID Data (2019)

FedPD: A Federated Learning Framework with Optimal Rates and Adaptivity to Non-IID Data (2020)

Benchmarking FedAvg and FedCurv for Image Classification Tasks (2023)

Fast-Convergent Federated Learning with Adaptive Weighting (2020)

Node Selection Toward Faster Convergence for Federated Learning on Non-IID Data (2021)

CFedAvg: Achieving Efficient Communication and Fast Convergence in Non-IID Federated Learning (2021)

Momentum Benefits Non-IID Federated Learning Simply and Provably (2023)

Widening the Network Mitigates the Impact of Data Heterogeneity on FedAvg (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Convergence of FedAvg on Non-IID Data.

FedAvg Convergence on Non-IID Data

1. Formal Problem Setting, Assumptions, and Source of Non-IID Effects

2. Convergence Analysis of Standard FedAvg on Heterogeneous Data

Main Recurrence (per-round expected descent) (Muhebwa et al., 26 May 2025, Li et al., 2019): $F_k(w)$ 9

Strongly convex case, decaying stepsize (Li et al., 2019): $k$ 1

Lower Bounds, Fixed Stepsize, and Communication Cost (Li et al., 2019, Zhang et al., 2020):

Heterogeneity-Driven Error Floor:

Practical Implications:

3. Empirical Characterization and the Role of Statistical Heterogeneity

4. Algorithmic Modifications for Improving Convergence in Non-IID Settings

a. Synchronization-based weighting: Kuramoto-FedAvg (Muhebwa et al., 26 May 2025)

b. Angle-based adaptive weighting: FedAdp (Wu et al., 2020)

c. Selective client participation: Probabilistic Node Selection (FedPNS) (Wu et al., 2021)

d. Communication-efficient variants: CFedAvg (Yang et al., 2021)

e. Momentum-augmented protocols: FedAvg-M (Cheng et al., 2023)

5. Asymptotic Regimes: Overparameterization and the Vanishing Effect of Heterogeneity

6. Fundamental Limits and Failure Modes

7. Practical Guidelines for Federated Optimization under Heterogeneity

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

FedAvg Convergence on Non-IID Data

1. Formal Problem Setting, Assumptions, and Source of Non-IID Effects

2. Convergence Analysis of Standard FedAvg on Heterogeneous Data

Main Recurrence (per-round expected descent) (Muhebwa et al., 26 May 2025, Li et al., 2019): Fk(w)F_k(w)Fk​(w)9

Strongly convex case, decaying stepsize (Li et al., 2019): kkk1

Lower Bounds, Fixed Stepsize, and Communication Cost (Li et al., 2019, Zhang et al., 2020):

Heterogeneity-Driven Error Floor:

Practical Implications:

3. Empirical Characterization and the Role of Statistical Heterogeneity

4. Algorithmic Modifications for Improving Convergence in Non-IID Settings

a. Synchronization-based weighting: Kuramoto-FedAvg (Muhebwa et al., 26 May 2025)

b. Angle-based adaptive weighting: FedAdp (Wu et al., 2020)

c. Selective client participation: Probabilistic Node Selection (FedPNS) (Wu et al., 2021)

d. Communication-efficient variants: CFedAvg (Yang et al., 2021)

e. Momentum-augmented protocols: FedAvg-M (Cheng et al., 2023)

5. Asymptotic Regimes: Overparameterization and the Vanishing Effect of Heterogeneity

6. Fundamental Limits and Failure Modes

7. Practical Guidelines for Federated Optimization under Heterogeneity

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Main Recurrence (per-round expected descent) (Muhebwa et al., 26 May 2025, Li et al., 2019): $F_k(w)$ 9

Strongly convex case, decaying stepsize (Li et al., 2019): $k$ 1