FedAvg Algorithm in Federated Learning

Updated 4 February 2026

FedAvg algorithm is a federated learning method that averages local model updates to achieve a global model without sharing raw data.
It uses synchronous rounds of local SGD with periodic server aggregation, effectively addressing challenges of non-IID data and partial client participation.
Empirical results demonstrate FedAvg’s robustness and minimal tuning requirements, establishing it as a practical baseline in diverse applications.

Federated Averaging (FedAvg) is a central algorithm in federated learning (FL), enabling the collaborative training of models across distributed clients without data sharing. Its design is based on interleaving local optimization with periodic, server-mediated aggregation, thus reducing both communication frequency and privacy risk compared to centralized alternatives. FedAvg’s simplicity and effectiveness, even under heterogeneous data distributions and partial client participation, have made it the de facto baseline for practical and theoretical study in FL.

1. Algorithm Definition and Update Rule

FedAvg operates in synchronous communication rounds. Let $M$ denote the number of clients, each with a local objective $F_c$ that is $L$ -smooth and $\mu$ -strongly convex. Define the global objective: $F(\theta) = \sum_{c=1}^M p_c F_c(\theta),$ where $p_c$ is the weight for client $c$ , often set proportional to client dataset size.

The FedAvg protocol is as follows (Wang et al., 2022):

Server step:
- Broadcast global model $\theta^{(t)}$ to all clients.
Local optimization (client $c$ in parallel):
- Initialize $\theta_c^{(t,0)} = \theta^{(t)}$ .
- For $h=0,\dots,H-1$ :
$\theta_c^{(t,h+1)} \leftarrow \theta_c^{(t,h)} - \eta \nabla F_c(\theta_c^{(t,h)}; \xi_{c,h}),$

where $\xi_{c,h}$ indexes local sampling (e.g., minibatch). - Return local update $\Delta_c^{(t)} = \theta^{(t)} - \theta_c^{(t,H)}$ to server.
Aggregation:
- Update global model:
$\theta^{(t+1)} \leftarrow \theta^{(t)} - \gamma \sum_{c=1}^M p_c \Delta_c^{(t)},$

often with $\gamma=1$ .

This paradigm extends to partial participation, variable weights, and non-IID data settings, with practical implementations further incorporating stochastic selection of participating clients and local data sampling (Li et al., 2019, Lee et al., 27 Feb 2025).

2. Classical and Refined Convergence Guarantees

Traditional analyses of FedAvg focus on convergence rates under varying assumptions about smoothness, convexity, data heterogeneity, and participation.

Classical analysis (bounded gradient dissimilarity):

Define the gradient dissimilarity by: $\mathbb{E}_c \|\nabla F_c(\theta) - \nabla F(\theta)\|^2 \leq \zeta^2.$ Under this assumption, the convergence of FedAvg for strongly convex and $L$ -smooth objectives is given by (Wang et al., 2022): $F(\theta^{(T)}) - F(\theta^*) \leq (1-\eta H \mu)^T [F(\theta^{(0)})-F(\theta^*)] + \frac{\zeta^2}{O(T)}.$ Consequently, large $\zeta$ predicts slow convergence, particularly under non-IID client distributions (Li et al., 2019).

Refined analysis (average drift at optimum):

Research has shown the “gradient dissimilarity” bound is overly pessimistic for practical data heterogeneity (Wang et al., 2022). Instead, the crucial quantity is the average drift at optimum: $\rho \equiv \| G(\theta^*) \|, \quad \text{where} \quad G(\theta) = \sum_c p_c \mathbb{E}_\xi \left[ \frac{ \theta - \theta_c^{(H)} }{ \eta H } \right]$ and $\theta^*$ is the global optimum. Empirically, $\rho$ is orders of magnitude smaller than naive gradient dissimilarity bounds, and convergence rates in practical settings are often not penalized by heterogeneity as long as $\rho\approx 0$ .

Empirical validation:

Analysis on real benchmarks (e.g., FEMNIST, StackOverflow) reveals that despite substantial “gradient dissimilarity,” the average drift $\rho^2$ remains nearly zero even for substantial local update intervals ( $H=1$ to $100$), explaining FedAvg's empirical robustness in practice (Wang et al., 2022).

3. Impact of Data Heterogeneity and Participation

The effect of non-IID data distributions and partial participation is central in understanding FedAvg’s behavior.

Non-IID effects: For heterogeneous clients, FedAvg’s ability to reach the global optimum depends primarily on the aggregate behavior of the local update trajectories, not the worst-case divergence between client gradients. If the average drift at optimum is negligible, then the convergence rate closely matches the homogeneous (IID) setting (Wang et al., 2022).
Partial participation: FedAvg retains $O(1/T)$ convergence (where $T$ is the number of effective gradient steps) even when only a fraction of clients participate in each round, with an overhead term that scales as $E^2/K$ for $E$ local steps and $K$ selected clients per round. Large $E$ accelerates convergence up to an optimal value, beyond which local drift dominates (Li et al., 2019).
Necessity of step size decay: Without decaying learning rates, FedAvg may converge only to an $O((E-1)\eta)$ -neighborhood of the optimum, even for strongly convex, smooth losses. A diminishing $\eta_t$ is necessary to eliminate this bias (Li et al., 2019).

4. Continuous-Time and Stochastic Dynamics Perspective

Recent work has formalized FedAvg's dynamics in the continuous-time domain using stochastic differential equations (SDEs), yielding deeper understanding of optimization and generalization properties (Overman et al., 31 Jan 2025).

SDE approximation: In the limit of small step sizes and many clients, the global FedAvg dynamic is governed by an Itô SDE of the form:

$d w_0(t) = \eta_0(t) \hat{M}(w_0(t)) dt + \eta_0(t) \sqrt{h} \hat{V}^{1/2}(w_0(t)) dB(t),$

where $\hat{M}$ captures effective drift and $\hat{V}$ is the instantaneous covariance of aggregated updates.

Normal approximation: The aggregate local updates can be approximated as Gaussian under mild moment bounds and a Lyapunov CLT, rigorously justifying the SDE scaling limit.
Bias-variance trade-off: In closed-form solution for quadratic client objectives, prolonged local intervals ( $E\gg1$ ) introduce both increased bias (drift away from the true minimizer) and larger variance (favoring wider, potentially better generalizing minima). Optimal communication frequency strikes a balance between these effects (Overman et al., 31 Jan 2025).

5. Empirical Performance and Practical Guidelines

Extensive empirical evaluation demonstrates that FedAvg is remarkably robust and stable under a wide range of data, models, and hyperparameters (Lee et al., 27 Feb 2025):

Stability: Across diverse tasks, models (e.g., ViT, CNNs), and hyperparameters ( $\eta$ , local epochs $L$ , batch size $B$ ), FedAvg achieves near-constant accuracy and low run-to-run variance.
Comparative performance: On medical imaging tasks such as blood cell and skin lesion classification, FedAvg matches or outperforms advanced FL algorithms (FedProx, FedSAM, FedDyn, etc.), with differences typically $\leq$ 0.5% (Lee et al., 27 Feb 2025).
Minimal tuning: With a default setting ( $\eta=10^{-3}$ , $L=1$ , $B=32$ , 10% client participation per round), FedAvg reaches near-optimal performance without the need for extensive hyperparameter sweeps.

A concise guideline for typical deployments would be:

Use $\eta \in [10^{-3}, 10^{-2}]$ and 1–5 local epochs.
For medical or resource-limited settings, prioritize FedAvg as the robust baseline.
Only consider more complex algorithms if the application demands the last $0.3\text{–}0.5\%$ in accuracy or has extreme distributional challenges.

6. Algorithmic Extensions and Theoretical Directions

FedAvg’s pseudo-gradient structure has inspired numerous extensions:

Learnable aggregation weights: Auto-FedAvg replaces static $p_i$ with dynamically learned $\alpha_i$ , optimized via gradient descent on outer held-out loss, demonstrating improved cross-client generalization, especially on non-IID data (Xia et al., 2021).
Bias correction mechanisms: When the participation rates $p_i^t$ are non-uniform and time-varying, standard FedAvg is biased and fails to minimize the global loss. The FedPBC variant corrects for this by delaying model broadcasts and enabling implicit mixing among active clients, restoring convergence guarantees (Xiang et al., 2023).
Accelerated variants: Extensions such as Federated Accelerated SGD (FedAc) achieve better scaling with synchronization frequency $K$ , providing provable acceleration (e.g., $O(M^{1/3})$ rounds for linear speedup with $M$ clients), particularly when the objective is third-order smooth. The trade-off is between acceleration (large $\gamma$ ) and stability (small $\gamma$ to limit client drift) (Yuan et al., 2020, Yuan, 2024).
Representation learning: FedAvg with fine-tuning can achieve significant improvements in transfer and generalization by learning low-dimensional representations that rapidly adapt to new client distributions (Collins et al., 2022).
Masking and invariance: FedGMA applies coordinate-wise AND-masks to gradient aggregations to enforce invariant directions across clients, thereby mitigating the “sewn-together” optima effect seen in heterogeneity and improving out-of-distribution generalization (Tenison et al., 2021).

7. Controversies and Open Problems

While the practical success of FedAvg is well documented, several theoretical and practical issues are active research areas:

Optimality under extreme heterogeneity: Although average drift at optimum may explain why FedAvg works in practice, there are settings (especially with strong pathologies, e.g., rare classes, adversarial sampling, or extreme communication failure) where performance deteriorates or strong bias is introduced (Xiang et al., 2023).
Objective inconsistency and fine-tuning: The global FedAvg update does not, in general, correspond to any single client’s optimum when data are highly heterogeneous, which can result in suboptimal generalization on unseen clients or tasks (Collins et al., 2022).
Quantifying acceleration limits: The interplay between local step size, communication frequency, and stochasticity produces non-trivial trade-offs, particularly in non-convex regimes and as the number of clients increases (Yuan, 2024, Yuan et al., 2020).
Beyond smooth convexity: Recent analyses have extended convergence results to non-convex and weakly quasi-convex regimes, but general characterization remains incomplete, especially for deep networks and in the presence of heavy-tailed or corrupted data (Overman et al., 31 Jan 2025).

References:

"On the Unreasonable Effectiveness of Federated Averaging with Heterogeneous Data" (Wang et al., 2022)
"Continuous-Time Analysis of Federated Averaging" (Overman et al., 31 Jan 2025)
"Revisit the Stability of Vanilla Federated Learning Under Diverse Conditions" (Lee et al., 27 Feb 2025)
"On the Convergence of FedAvg on Non-IID Data" (Li et al., 2019)
"Auto-FedAvg: Learnable Federated Averaging for Multi-Institutional Medical Image Segmentation" (Xia et al., 2021)
"FedAvg with Fine Tuning: Local Updates Lead to Representation Learning" (Collins et al., 2022)
"Towards Bias Correction of FedAvg over Nonuniform and Time-Varying Communications" (Xiang et al., 2023)
"On Principled Local Optimization Methods for Federated Learning" (Yuan, 2024)
"A Unified Linear Speedup Analysis of Federated Averaging and Nesterov FedAvg" (Qu et al., 2020)
"Federated Accelerated Stochastic Gradient Descent" (Yuan et al., 2020)
"Gradient Masked Federated Optimization" (Tenison et al., 2021)
"A Comparative Evaluation of FedAvg and Per-FedAvg Algorithms for Dirichlet Distributed Heterogeneous Data" (Reguieg et al., 2023)