FedAvg Convergence on Non-IID Data
- The paper shows that non-IID data induces client drift, penalizing per-round descent and establishing an asymptotic bias floor in FedAvg.
- It explores algorithmic modifications—like dynamic weighting, momentum, and selective client participation—that counteract heterogeneity-induced convergence slowdowns.
- Empirical evidence indicates that strategies such as overparameterized models and balanced local epochs can nearly match centralized training performance under non-IID conditions.
Federated Averaging (FedAvg) is the canonical optimization protocol for cooperative model training under data locality constraints. The central challenge in federated settings is that client data distributions are typically non-IID—statistically heterogeneous—which introduces client drift and impedes global convergence. The convergence of FedAvg in such regimes has been the focus of extensive theoretical and empirical research, revealing inherent limitations, algorithmic modifications to counter heterogeneity-induced drift, and asymptotic regimes where FedAvg can nearly match centralized training.
1. Formal Problem Setting, Assumptions, and Source of Non-IID Effects
Federated optimization targets the minimization of a global objective of the general form: where each encodes the local empirical or population risk on client , and is a weighting proportional to local data volume. Non-IIDness refers to the case where the distributions underlying the are non-identical, causing disparate gradient signals across clients.
Standard theoretical frameworks for analyzing FedAvg convergence in this regime assume:
- -smoothness: Each and hence satisfies .
- Strong convexity (): is often assumed to be –strongly convex to enable convergence guarantees, but results also exist for general convex and nonconvex .
- Stochastic gradient noise: , .
A central heterogeneity parameter is the gradient dissimilarity or client drift term: which vanishes in the IID limit and dominates the convergence penalty as heterogeneity increases (Muhebwa et al., 26 May 2025, Li et al., 2019).
2. Convergence Analysis of Standard FedAvg on Heterogeneous Data
The basic FedAvg iteration for round is:
Main Recurrence (per-round expected descent) (Muhebwa et al., 26 May 2025, Li et al., 2019):
Thus, the descent per round is penalized by , which can dominate under severe heterogeneity.
Strongly convex case, decaying stepsize (Li et al., 2019):
where is the number of local SGD steps per round (see Section 3).
Lower Bounds, Fixed Stepsize, and Communication Cost (Li et al., 2019, Zhang et al., 2020):
- For fixed and , the algorithm admits a bias floor: FedAvg provably does not converge to the exact minimizer, with excess error lower bounded by .
- To reach error , the number of communication rounds scales as under only smoothness, even with decaying stepsizes (Zhang et al., 2020).
- If neither gradient norms nor dissimilarity are controlled, FedAvg can diverge even on simple nonconvex or misaligned loss functions.
Heterogeneity-Driven Error Floor:
Several works formalize that even in the limit , the limiting error admits a residual for gradient divergence measures (Casella et al., 2023, Li et al., 2019, Wu et al., 2020).
Practical Implications:
- Communication vs computation trade-off: too large amplifies drift, too small increases required communication (Li et al., 2019, Casella et al., 2023).
- Oscillation and instability can occur when is large under severe non-IIDness.
3. Empirical Characterization and the Role of Statistical Heterogeneity
FedAvg’s empirical convergence is sensitive to the type and degree of non-IID data partitioning. Several benchmark studies (Casella et al., 2023, Wu et al., 2020) report:
- Mild heterogeneity (e.g., Dirichlet, moderate label skew): Only moderate slowdowns in convergence; increasing the number of local epochs yields substantial reduction in communication rounds needed to achieve a target accuracy.
- Severe heterogeneity (e.g., single-class per client, pathological label splits): Substantial slowdowns, lower asymptotic accuracy; excessive can cause oscillatory or divergent behavior, exacerbated drift, or outright failure to reach target performance within practical timeframes.
Experimental data (below, excerpted from (Wu et al., 2020) and (Casella et al., 2023)) illustrate this phenomenon:
| Setting | MNIST (95%) | FashionMNIST (80%) |
|---|---|---|
| 5 IID + 5 1-class non-IID (FedAvg) | 133 | 222 |
| 6 IID + 4 1-class non-IID (FedAvg) | 99 | 167 |
Rounds to accuracy threshold (FedAvg); increasing non-IIDness increases number of rounds required.
4. Algorithmic Modifications for Improving Convergence in Non-IID Settings
Numerous approaches have been developed to mitigate FedAvg’s heterogeneity-induced slowdowns:
a. Synchronization-based weighting: Kuramoto-FedAvg (Muhebwa et al., 26 May 2025)
- Aggregation weights are assigned dynamically according to client phase alignment:
where is the angle between the client's update and the mean update direction.
- Kuramoto-FedAvg provably shrinks the drift penalty to , tightening the standard FedAvg per-round descent and reducing required communication rounds.
b. Angle-based adaptive weighting: FedAdp (Wu et al., 2020)
- Aggregation weights are modulated per client based on the (smoothed) angle between local and global gradients:
where is a downward nonlinear mapping (Gompertz-like) of client alignment.
- Per-round decrease is provably improved: .
c. Selective client participation: Probabilistic Node Selection (FedPNS) (Wu et al., 2021)
- Adverse clients (identified via negative inner-product with global gradient) are demoted via reduced sampling probability in subsequent rounds.
- Reduces weight divergence and improves convergence bounds over FedAvg.
d. Communication-efficient variants: CFedAvg (Yang et al., 2021)
- Applies SNR-constrained compression and error feedback, yet achieves convergence rate.
- Data heterogeneity enters via a variance-like term , but is controlled as in uncompressed FedAvg.
e. Momentum-augmented protocols: FedAvg-M (Cheng et al., 2023)
- Adding heavy-ball momentum stabilizes local trajectories and obviates the need for explicit heterogeneity bounds.
- Achieves optimal rates with constant stepsize, even under extreme drift.
5. Asymptotic Regimes: Overparameterization and the Vanishing Effect of Heterogeneity
Recent theory shows that in the overparameterized regime (very wide neural networks), the effect of data heterogeneity on FedAvg convergence—and generalization—diminishes polynomially in the width (Jian et al., 18 Aug 2025):
- With fully-connected or convolutional networks of width (infinite-width neural tangent kernel regime), the model divergence term quantifies per-round drift.
- In the limit, both local and global models linearize, training tracks centralized GD exactly, and FedAvg generalizes identically to pooled training for matched total update steps.
- Empirically, the performance gap between IID and extreme non-IID settings vanishes as increases, confirmed across MNIST/CIFAR-10 and various architectures.
6. Fundamental Limits and Failure Modes
FedAvg's convergence on non-IID data is fundamentally limited by:
- Uncontrolled drift: Without either bounded gradient norms or decaying stepsize, iterates may diverge, even for trivial non-IID constructions (Zhang et al., 2020).
- Irreducible bias: The heterogeneity-driven term induces an asymptotic accuracy floor unless additional structure (e.g. strong convexity, bounded drift, overparameterization) is present.
- Lower bounds: No protocol following the Computation-Then-Aggregation (CTA) template can break the communication barrier without extra problem structure (Zhang et al., 2020).
Mitigation: Adaptive weighting, node selection, regularization (e.g. FedProx/FedCurv), momentum, or operating in the overparameterized limit can alleviate, but not entirely eliminate, these effects unless strong assumptions are met.
7. Practical Guidelines for Federated Optimization under Heterogeneity
- Step-size scheduling: Use decaying stepsizes whenever drift or heterogeneity cannot be tightly bounded. Fixed stepsizes can be salvaged with momentum (Cheng et al., 2023).
- Local epochs (): Moderate values () balance communication and drift penalties. Excessive can amplify divergence in severe non-IID settings (Li et al., 2019, Casella et al., 2023).
- Client selection and aggregation weighting: Favor strategies that dynamically suppress misaligned or adverse updates (Kuramoto-FedAvg, FedPNS, FedAdp).
- Overparameterization: Widening networks quantitatively mitigates data heterogeneity penalties, recovering centralized performance (Jian et al., 18 Aug 2025).
- Algorithmic stability: In settings with uncontrolled drift or high label skew, consider algorithmic extensions such as FedProx, SCAFFOLD, or primal-dual schemes.
In summary, while FedAvg admits or rates under smoothness and convexity assumptions, its convergence in non-IID regimes is systematically hampered by heterogeneity-induced drift. Recent innovations—phase- or angle-based dynamic aggregation, momentum, and overparametric model scaling—demonstrate both theoretically and empirically that the drift penalty can be sharply reduced, but not entirely eliminated except in asymptotic regimes (Muhebwa et al., 26 May 2025, Wu et al., 2020, Jian et al., 18 Aug 2025, Cheng et al., 2023).