Federated Stochastic Gradient Descent

Updated 30 December 2025

Federated Stochastic Gradient Descent is a distributed extension of classical SGD that trains models over multiple clients without sharing raw data.
It leverages partial client participation and stale gradient reuse to create implicit momentum, balancing convergence speed with communication efficiency.
Algorithmic variants address challenges like data heterogeneity, personalization, and Byzantine resilience, offering practical trade-offs for real-world federated systems.

Federated Stochastic Gradient Descent (FedSGD) is a core algorithmic primitive within federated learning, where a population of autonomous clients cooperatively train a centralized or decentralized statistical model without direct exchange of their raw data. FedSGD extends classical stochastic gradient descent to distributed, heterogeneous, and communication-constrained regimes, introducing new phenomena absent from conventional (i.e., datacenter or synchronous) SGD. Recent research has elucidated its intrinsic properties, convergence guarantees, system-level trade-offs, and algorithmic variants designed for heterogeneity, privacy, and resilience.

1. Core Algorithmic Structure and Self-Induced Momentum

In the canonical FedSGD setup, $K$ clients each possess a local dataset $\mathcal{D}_k$ of size $n_k$ . The global learning objective is

$\min_{w\in\mathbb{R}^d} f(w) = \sum_{k=1}^K p_k\,f_k(w),\quad p_k = \frac{n_k}{\sum_j n_j},\quad f_k(w) = \frac{1}{n_k} \sum_{(x_i, y_i) \in \mathcal{D}_k} \ell(w; x_i, y_i)$

During each communication round:

The server samples a subset $S_t$ (size $N$ ) of clients and broadcasts $w^t$ .
Each selected client computes a stochastic gradient estimate (averaged over $H$ sampled minibatch gradients), returns $g^t_k$ to server.
The server aggregates all client gradients, with non-participating clients reusing their previous $g^{t-1}_k$ :

$\mathcal{D}_k$ 0

$\mathcal{D}_k$ 1

This pattern introduces, at the global update level, an implicit “self-induced momentum” effect. The update equation can be written as

$\mathcal{D}_k$ 2

with momentum coefficient $\mathcal{D}_k$ 3 arising from reuse of stale gradients due to partial client participation. This unification of stale-gradient effects and momentum establishes a precise quantitative link: federated SGD with random sampling induces momentum proportional to the fraction of unqueried clients (Yang et al., 2022).

2. Convergence Theory and Impact of System Bias

FedSGD, under suitable conditions (local $\mathcal{D}_k$ 4-Lipschitz smoothness, bounded gradient variance $\mathcal{D}_k$ 5, and a gradient coherence parameter $\mathcal{D}_k$ 6), possesses sublinear convergence for nonconvex objectives: $\mathcal{D}_k$ 7 The denominator $\mathcal{D}_k$ 8 reflects how increased staleness ( $\mathcal{D}_k$ 9) degrades convergence: smaller $n_k$ 0 (number of participants per round) increases staleness and slows learning. The mean staleness is geometric with mean $n_k$ 1, establishing a direct trade-off between communication cost per round and the implicit momentum injected into principal descent dynamics. The optimal choice of $n_k$ 2 depends on available communication, bandwidth limits, and client selection strategies (Yang et al., 2022).

3. Algorithmic Variants: Heterogeneity, Personalization, and Robustness

Multiple algorithmic extensions have been developed to address data heterogeneity, statistical drift, and system vulnerabilities:

Depersonalized Federated SGD: To handle non-IID data, FedDeper alternates two SGD loops per client—one optimizing the local personalized loss, one a surrogate with a penalization term that subtracts personal drift. This mechanism reduces update variance and accelerates convergence while ensuring each round's update is less affected by outlying client distributions. The empirical results demonstrate improved test accuracy and faster convergence compared to FedAvg and other baselines, with the depersonalization step yielding marked benefits under low client sampling rates (Zhou et al., 2022).
Personalized Exact Federated SGD: Exploiting parameter decomposition ( $n_k$ 3 global, $n_k$ 4 per-client), PFLEGO orchestrates unbiased SGD over both sets by having clients perform local updates on $n_k$ 5 and communicate only gradients relevant to $n_k$ 6. This stratification achieves optimal test accuracy in personalized regimes (e.g., Omniglot, CIFAR-10) and lowers both computation and communication per round (Nikoloutsopoulos et al., 2022).
Byzantine Resilient FedSGD: The two-time-scale local SGD method combines fast updates of stochastic gradient estimates with slow parameter iteration, and introduces robust aggregation via comparative elimination (excluding the furthest $n_k$ 7 out of $n_k$ 8 client results). This schema achieves exact convergence under standard $n_k$ 9-redundancy and polylogarithmic communication complexity—substantially improving robustness compared to previous Byzantine-resilient approaches, which could guarantee only approximate stationarity (Dutta et al., 2024).
Compression and Quantization: Algorithms such as GDCI and Stochastic-Sign SGD investigate the impact of compressing local iterates or quantized gradients before aggregation. Their theoretical analyses show that unbiased random compression (variance $\min_{w\in\mathbb{R}^d} f(w) = \sum_{k=1}^K p_k\,f_k(w),\quad p_k = \frac{n_k}{\sum_j n_j},\quad f_k(w) = \frac{1}{n_k} \sum_{(x_i, y_i) \in \mathcal{D}_k} \ell(w; x_i, y_i)$ 0) slows convergence only up to an $\min_{w\in\mathbb{R}^d} f(w) = \sum_{k=1}^K p_k\,f_k(w),\quad p_k = \frac{n_k}{\sum_j n_j},\quad f_k(w) = \frac{1}{n_k} \sum_{(x_i, y_i) \in \mathcal{D}_k} \ell(w; x_i, y_i)$ 1 neighborhood, with precise bit-level trade-offs. Stochastic-Sign SGD, specifically, delivers 32x compression versus full-precision SGD and incorporates noise-based differential privacy and Byzantine tolerance in a unified manner (Khaled et al., 2019, Jin et al., 2020).
Variance Reduction and Acceleration: Extensions incorporating local SVRG steps (FedAvg-SVRG) or momentum-based acceleration (FedAc) can improve convergence from $\min_{w\in\mathbb{R}^d} f(w) = \sum_{k=1}^K p_k\,f_k(w),\quad p_k = \frac{n_k}{\sum_j n_j},\quad f_k(w) = \frac{1}{n_k} \sum_{(x_i, y_i) \in \mathcal{D}_k} \ell(w; x_i, y_i)$ 2 to $\min_{w\in\mathbb{R}^d} f(w) = \sum_{k=1}^K p_k\,f_k(w),\quad p_k = \frac{n_k}{\sum_j n_j},\quad f_k(w) = \frac{1}{n_k} \sum_{(x_i, y_i) \in \mathcal{D}_k} \ell(w; x_i, y_i)$ 3 or reduce the number of required synchronization rounds from $\min_{w\in\mathbb{R}^d} f(w) = \sum_{k=1}^K p_k\,f_k(w),\quad p_k = \frac{n_k}{\sum_j n_j},\quad f_k(w) = \frac{1}{n_k} \sum_{(x_i, y_i) \in \mathcal{D}_k} \ell(w; x_i, y_i)$ 4 (FedAvg) to $\min_{w\in\mathbb{R}^d} f(w) = \sum_{k=1}^K p_k\,f_k(w),\quad p_k = \frac{n_k}{\sum_j n_j},\quad f_k(w) = \frac{1}{n_k} \sum_{(x_i, y_i) \in \mathcal{D}_k} \ell(w; x_i, y_i)$ 5 (FedAc), especially under strong convexity or higher-order smoothness (Rostami et al., 2022, Yuan et al., 2020).

4. State-Dependent Parameters, Trade-offs, and System Design

Key parameters—number of local steps ( $\min_{w\in\mathbb{R}^d} f(w) = \sum_{k=1}^K p_k\,f_k(w),\quad p_k = \frac{n_k}{\sum_j n_j},\quad f_k(w) = \frac{1}{n_k} \sum_{(x_i, y_i) \in \mathcal{D}_k} \ell(w; x_i, y_i)$ 6), the learning rate ( $\min_{w\in\mathbb{R}^d} f(w) = \sum_{k=1}^K p_k\,f_k(w),\quad p_k = \frac{n_k}{\sum_j n_j},\quad f_k(w) = \frac{1}{n_k} \sum_{(x_i, y_i) \in \mathcal{D}_k} \ell(w; x_i, y_i)$ 7), number of participants per round ( $\min_{w\in\mathbb{R}^d} f(w) = \sum_{k=1}^K p_k\,f_k(w),\quad p_k = \frac{n_k}{\sum_j n_j},\quad f_k(w) = \frac{1}{n_k} \sum_{(x_i, y_i) \in \mathcal{D}_k} \ell(w; x_i, y_i)$ 8), and the structure of aggregation—directly impact both convergence and resource demands.

Parameter	Impact on Convergence	Impact on System Cost
$\min_{w\in\mathbb{R}^d} f(w) = \sum_{k=1}^K p_k\,f_k(w),\quad p_k = \frac{n_k}{\sum_j n_j},\quad f_k(w) = \frac{1}{n_k} \sum_{(x_i, y_i) \in \mathcal{D}_k} \ell(w; x_i, y_i)$ 9 (clients/round)	Larger $S_t$ 0: reduces staleness, lessens implicit momentum, accelerates convergence	Increases per-round communication
$S_t$ 1 (local steps)	Increases computation, can reduce stochastic gradient variance	Potentially raises per-round client load
Compression ratio	High compression slows convergence proportional to variance parameter $S_t$ 2	Reduces uplink bandwidth cost

Careful system design must balance trade-offs among client participation, gradient staleness, communication cost, and heterogeneity-induced drift. Increasing $S_t$ 3 enhances local compute efficiency but can worsen straggler effects or lead to local overfitting. Incorporation of adaptive learning rates, penalization terms, and variance-reduction (e.g., through mean-field mechanisms) further supports stable operation under a variety of real-world constraints (Yang et al., 2022, Yuan et al., 2023).

5. Comparative Empirical Results and Application Domains

FedSGD and its extensions have been empirically validated in diverse settings:

In hospital resource prediction, decentralized FedSGD on an empirical network graph achieved lower mean-squared error (MSE) in hospital length-of-stay prediction compared to FedAvg, with test MSE 1.354 versus ∼1.8–1.9 for FedAvg (Balik, 2024).
For classification under high non-IID (heterogeneous) splits, methods such as PFLEGO outperform both FedAvg and prior personalized methods in top-1 accuracy, notably yielding ~2–5% improvements for highly personalized tasks (Nikoloutsopoulos et al., 2022).
For robust federated optimization, resilient two-time-scale SGD matches the $S_t$ 4 convergence of full-batch SGD, tolerates adversarial clients, and requires no extra communication relative to standard FedSGD (Dutta et al., 2024).
Communication-efficient schemes such as Stochastic-Sign SGD achieve accuracy similar to DP-FedSGD at a fraction of bandwidth costs, supporting both local differential privacy and Byzantine robustness (Jin et al., 2020).

6. Theoretical Advances and Future Directions

Recent theoretical analyses have unified the roles of staleness, implicit momentum, data heterogeneity, and variance in federated stochastic optimization. There is a precise characterization of how communication constraints manifest as momentum effects, how personalization and depersonalization shape convergence under non-IID data, and how quantized communication induces bounded steady-state error neighborhoods.

Future directions include:

Systematic integration of variance-reduction and momentum on arbitrary network graphs.
Adaptive mechanisms for online tuning of participation rate $S_t$ 5 and local step size $S_t$ 6.
Unified methods achieving privacy, robustness, and statistical efficiency in the presence of unreliable communication, partial participation, and adversarial agents.
Expanded theoretical guarantees under relaxed assumptions (nonconvexity, unbounded heterogeneity).
Empirical benchmarking in open federated environments (e.g., mobile networks, cross-institutional collaborations).