FederatedAveraging Algorithm

Updated 4 December 2025

FederatedAveraging is a decentralized optimization protocol that aggregates local updates to solve global empirical risk minimization without centralizing data.
It alternates rounds of local stochastic gradient descent with weighted server aggregation, effectively managing non-IID data and communication constraints.
Various adaptations, such as FedGMA, FedCostWAvg, and FedAWARE, refine FedAvg to enhance convergence, robustness, and performance in real-world federated settings.

FederatedAveraging (FedAvg) is the canonical algorithm underpinning most federated learning systems, providing a scalable, communication-efficient protocol for decentralized model optimization over data distributed across many clients. FedAvg alternates rounds of local stochastic optimization with a global aggregation step, enabling joint training without centralizing raw data and tolerating data heterogeneity. This article provides a detailed, rigorous exposition of FedAvg and principal variants, covering the mathematical formalism, convergence theory in diverse settings, and adaptions that enhance performance or robustness.

1. Mathematical Formulation and Protocol

The FedAvg protocol targets the federated empirical risk minimization problem over $K$ clients, each holding local data $\mathcal{D}_k$ :

$\min_{w\in\mathbb{R}^d} F(w) = \sum_{k=1}^K \frac{n_k}{N} F_k(w) \,, \quad F_k(w) = \frac{1}{n_k} \sum_{i=1}^{n_k} \ell(w; x_{k,i}, y_{k,i}), \quad N = \sum_{k=1}^K n_k,$

where $\ell$ is the task-specific loss function. FedAvg proceeds in synchronous communication rounds. In each round $t$ :

The server samples a subset $S_t$ of clients and broadcasts the current global model $w_t$ .
Each client $k \in S_t$ initializes $w_{t,0}^k = w_t$ , performs $E$ steps of local (stochastic) gradient descent:

$w_{t,i+1}^k = w_{t,i}^k - \eta_{\mathrm{loc}} \nabla F_k(w_{t,i}^k), \;\; (i = 0,\ldots,E-1).$

The client returns $w_{t,E}^k$ to the server.
The server aggregates the updates via data-size-weighted averaging:

$w_{t+1} = \sum_{k \in S_t} \frac{n_k}{\sum_{j \in S_t} n_j} w_{t,E}^k.$

This process is captured in the classical pseudocode form (McMahan et al., 2016, Chen et al., 2019, Tenison et al., 2021):

Input: initial model w₀, rounds T, local steps E, η_loc
for t = 0,…,T−1:
    Server samples clients S_t
    for each client k ∈ S_t (in parallel):
        w_{t,0}^k ← w_t
        for i = 0,…,E−1:
            compute g_{t,i}^k = ∇F_k(w_{t,i}^k)
            w_{t,i+1}^k = w_{t,i}^k − η_loc · g_{t,i}^k
        return w_{t,E}^k
    Server aggregates:
        w_{t+1} = ∑_{k∈S_t} (n_k/∑_{j∈S_t}n_j) · w_{t,E}^k

Key hyperparameters are the local update steps $E$ , client learning rate $\eta_{\mathrm{loc}}$ , client subsampling fraction, and aggregation weights (default: proportional to $n_k$ ).

2. Convergence Analysis and Robustness

FedAvg’s convergence properties have been rigorously analyzed under convex, nonconvex, and heterogeneous data settings. For convex objectives and smooth losses, under appropriate bounded variance and Lipschitz assumptions, FedAvg achieves $\mathcal{O}(1/\sqrt{T})$ convergence to the solution of a weighted sum of the local objectives (Herlock et al., 14 Jul 2025, Li et al., 2022, McMahan et al., 2016). Crucially, intermittent availability and non-uniform participation are handled in the agnostic FedAvg framework, which provides a convergence guarantee without knowledge or estimation of client participation distributions (Herlock et al., 14 Jul 2025).

When local data are non-IID and local steps $E$ are large, model drift increases and convergence can plateau or, in severe cases, diverge (McMahan et al., 2016). For semi-smooth (ReLU) networks, robust convergence holds under relaxed “semi-smooth” and “semi-Lipschitz” conditions (Li et al., 2022). The global step size and local step size must be chosen to bound drift, and the communication-computation tradeoff is typically managed by tuning $E$ within a small range (e.g., $E=1\text{--}5$ ), decaying it over time if instability is observed.

In the continuous-time limit, the dynamics of FedAvg correspond to a stochastic differential equation. This yields compact proofs of global convergence and exposes bias–variance trade-offs, notably showing that increasing $E$ improves generalization (escaping sharp minima by variance injection) but increases the steady-state bias away from the true minimizer (Overman et al., 31 Jan 2025).

3. Variants, Extensions, and Adaptive Aggregation

Numerous FedAvg variants have been proposed to address specific challenges:

FedGMA (Gradient-Masked FedAvg): Incorporates an AND-mask on the aggregated client gradients, zeroing out coordinates where fewer than a $p$ -fraction of clients agree on sign. This modification addresses non-IID-induced “sewn optima” and improves out-of-distribution generalization (Tenison et al., 2021).

$M_t^{(i)} = \mathbb{I}\left( \left|\sum_{k=1}^K \mathrm{sign}(g_{t,k}^{(i)})\right| \geq p K \right), \quad w_{t+1} = w_t - \eta_{\mathrm{srv}} (M_t \odot \bar{g}_t)$

FedCostWAvg: Weights aggregation not only by the size of client data but also by the achieved local cost reduction (cost-drop ratio), adaptively emphasizing clients making substantial progress (Mächler et al., 2021). This yields consistently higher accuracy and faster convergence, especially in heterogeneous environments.
FedAU: Addresses unknown participation frequencies by estimating client-specific optimal weights online from empirical participation intervals. Rigorous analysis shows that FedAU can provably converge to the optimum even without prior knowledge of participation, matching or exceeding memory-heavy baselines (Wang et al., 2023).
FedAWARE: Maximizes client gradient-diversity at each round via adaptive reweighting, seeking the convex combination of client updates that least conflicts in direction. Theoretical bounds show that this approach directly mitigates statistical and system heterogeneity and accelerates convergence under non-IID splits (Zeng et al., 2023).
Precision-Weighted FedAvg: Aggregates client updates with inverse-variance (precision) weights, computed from empirical gradient variances. This downweights noisy or unreliable updates, substantially boosting test accuracy and reliability in non-IID settings, with speedups up to $37\times$ observed empirically (Reyes et al., 2021).

Tables summarizing weighting schemes and main ideas:

Variant	Aggregation Weights	Key Mechanism
FedAvg	$n_k / N$	Data-size weighted averaging
PW-Fed	Inverse-variance; $1/$grad. variance	Downweights noisy client updates
FedCostWAvg	$\alpha$ ·sample ratio + $(1-\alpha)$ ·cost-drop ratio	Rewards clients making more progress
FedAU	Online est. of $1/p_k$ (from intervals)	Compensates unknown participation
FedGMA	AND-mask over gradient signs	Filters out non-consensus coords
FedAWARE	Adaptive: maximize grad.-diversity	Reduces aggregation conflict

4. Communication, Computation, and System Design

FedAvg’s principal motivation is to reduce communication by shifting computation to clients. Performing multiple local SGD steps per round enables a $\mathcal{O}(10\text{--}100\times)$ reduction in communication compared to synchronous SGD (McMahan et al., 2016, Chen et al., 2019). However, large $E$ increases model drift between clients, so in practical implementations $E$ is kept small, and model divergence is monitored.

System parameters such as the number of clients per round, local batch size, and optimizer choice can be flexibly adjusted. Large-scale deployments (e.g., LLM training on mobile devices) have validated the practical scalability, with standard settings of 500 clients per round, per-client learning rates around 0.5, and per-server rates at 1.0 (Chen et al., 2019).

Extensions to the classical protocol include semi-synchronous orchestration, joint optimization of resource allocation (bandwidth scheduling), and non-blocking aggregation to minimize wall-clock time under straggler effects (You et al., 2022).

5. Theoretical Bias and Advanced Analysis

Recent work decomposes limiting bias in FedAvg into two distinct sources: a deterministic client heterogeneity term and a noise-induced bias from stochastic gradients. Under a constant step size $\gamma$ and $H$ local steps, for a strongly convex and smooth objective, the stationary bias admits the expansion (Mangold et al., 2 Dec 2024):

$\mathrm{Bias}(\gamma) = \gamma B_1^{\mathrm{hetero}} + \gamma B_1^{\mathrm{noise}} + O(\gamma^2)$

where $B_1^{\mathrm{hetero}}$ captures client dissimilarity and $B_1^{\mathrm{noise}}$ matches classical SGD noise bias. The Richardson-Romberg (RR) extrapolation technique can remove this $O(\gamma)$ bias by combining iterates from runs at two step sizes, yielding an $O(\gamma^2)$ bias overall, at the cost of doubling per-client computation (Mangold et al., 2 Dec 2024).

6. Applications, Empirical Results, and Practical Recommendations

FedAvg has demonstrated robust convergence and strong empirical performance across a wide range of domains—including deep image models, sequence models, and LLMs trained on mobile devices (McMahan et al., 2016, Chen et al., 2019). Detailed ablations confirm that communication reductions of $10\times$ or more are achievable without sacrificing accuracy, provided hyperparameters are well-tuned and local divergence is monitored.

Multiple extensions, such as blockwise partial averaging (Lee et al., 2022), hybrid cost-based aggregation (Mächler et al., 2021), and loss-weighted softmax aggregation (Mansour et al., 2022), yield measurable advantages in regimes with large client heterogeneity or varying participation. These variants are competitive with or outperform FedAvg in public federated benchmarks, as verified in the MICCAI FETS challenge and diverse academic datasets.

In summary, FedAvg remains a foundational algorithm in federated learning, with a mature theoretical framework, numerous robustifications, and an expanding suite of adaptive modifications that enhance performance under the practical challenges of distributed, heterogeneous data and intermittent participation.