Decaying Local SGD: Adaptive Distributed Training

Updated 4 October 2025

Decaying Local SGD is a distributed stochastic optimization technique that adaptively reduces local update frequency or step-size to control model drift and enhance convergence.
It employs dynamic decay schedules for both step-size and communication intervals to balance early computational gains with later synchronization for improved stability.
Empirical and theoretical results show that these decay strategies boost scalability, generalization, and convergence rates while mitigating drift in heterogeneous and nonconvex settings.

Decaying Local SGD is a class of distributed stochastic optimization algorithms in which the frequency, magnitude, or duration of local updates—i.e., the number of local gradient steps or local learning rates—are adaptively reduced ("decayed") during training. This strategy generalizes classic Local SGD, where parallel workers perform several independent SGD steps before synchronizing, by introducing dynamic schedules that adjust local computation or communication intervals as training progresses. Decaying Local SGD encompasses approaches arising from communication-efficient distributed learning, nonconvex optimization, and federated learning, and is intricately linked to advances in scalability, generalization, and theoretical optimality.

1. Foundations and Motivation

The canonical Local SGD algorithm is motivated by the need to mitigate communication bottlenecks in large-scale distributed settings, particularly when mini-batch SGD's communication-to-computation ratio limits scalability (Stich, 2018). In Local SGD, each worker performs $H$ local steps before synchronizing parameters. However, if $H$ is fixed and large, the model divergence (or "drift") accumulates, potentially degrading convergence or generalization, especially in heterogeneous-data or decentralized scenarios.

Decaying Local SGD introduces mechanisms to adaptively reduce either the step size or the number of local steps over time. The main motivations include:

Preserving communication efficiency at scale, but mitigating the drift incurred by prolonged local training.
Leveraging large local updates early in training (when optimization is far from the optimum) and reducing local intervals later to ensure better model consensus and reduced fixed-point error.
Achieving stronger generalization, sharper convergence rates, and optimal time complexity in both convex and nonconvex settings (Fradin et al., 27 Sep 2025).

This paradigm is empirically and theoretically explored in various research threads, notably via decaying learning rates (Stich, 2018, Basu et al., 2019), dynamically shrinking local update intervals (Li et al., 2019), adaptive communication schedules (Spiridonoff et al., 2021), and explicit decaying step-size schedules (Fradin et al., 27 Sep 2025).

2. Algorithmic Principles and Schedules

Decaying Local SGD variants are characterized by the interplay of the following elements:

Step-size Decay: The local (and often global) SGD step size is decreased according to a schedule; for example, shifted polynomial decay

$\eta_t = \frac{4}{\mu(a + t)}$

for strongly convex losses, with $a > \max\{16\kappa, H\}$ (Stich, 2018). Similar polynomial or geometric decays appear in deep learning and linear regression (Wu et al., 2021, Basu et al., 2019).

Decaying Local Steps or Communication Intervals: The number of local updates between synchronizations, or the communication interval $H_t$ , is dynamically reduced. In decentralized or federated settings, this is formalized as:

$H_t = \begin{cases} H_0 & t \le t_1 \ H_1 & t_1 < t \le t_2 \ \vdots & \vdots \ 1 & t \gg t_1 \end{cases}$

or continuously via a formula such as $H_i = a(i+1)$ for the interval following the $i$ 'th communication (Spiridonoff et al., 2021).

Decaying Number of Local Steps in FL: In federated settings, reduction in the number of local steps $K$ per client per round is scheduled as a function of elapsed time, progress, or loss decrease, e.g.,

$K_r^\star = \left\lceil \sqrt[3]{ \frac{F(\bar{x}_r)}{F(\bar{x}_0)} K_0 } \right\rceil$

where $F(\bar{x}_r)$ is the current objective value (Mills et al., 2023).

Decaying Communication Frequency: Communication becomes less frequent as iterations increase, for example, with an inter-communication interval $H_i = a(i + 1)$ (communication occurs at times $\tau_1, \tau_2, \ldots$ with growing gaps (Spiridonoff et al., 2021)).
Adaptive Local Step-size Within Rounds: The most recent advancement is to assign a (possibly much larger) step size for early local updates within a local computation block, then decay it as the local index increases:

$\eta_j = \sqrt{ \frac{b}{(j+1)(\log K + 1)} } \cdot \eta_g,$

for the $j$ -th local step ( $j=0,\ldots,K-1$ ), with $\eta_g$ the global step-size and $b$ a scaling parameter (Fradin et al., 27 Sep 2025).

3. Convergence Results and Theoretical Analysis

Strongly Convex Regime

For smooth, strongly convex problems, Local SGD with decaying step size and communication period achieves the same optimal rate as mini-batch SGD, provided that $H$ is not too large: $\mathbb{E}[f(\hat{x}_T)] - f^* = O\left( \frac{\sigma^2}{\mu K b T} \right)$ where the key is to set the communication period $H = O(\sqrt{T/(Kb)})$ (Stich, 2018).

Decaying the local stepsize or communication period ensures that the error incurred by infrequent synchronization ("drift") does not dominate asymptotic convergence. The error term due to local divergence is controlled as $O(G^2 H^2 \eta_t^2)$ , and the overall error can match the bias-variance tradeoff of mini-batch SGD under suitable decay policies (Stich, 2018, Basu et al., 2019).

Nonconvex Settings and Time Complexity

Recent work reveals that, under classical time complexity metrics (counting both computation $h$ and communication $\tau$ ), standard Local SGD is suboptimal. Decaying Local SGD—by increasing the local step size (up to $O(\sqrt{n})$ larger in early local updates) and decaying across local steps—closes this gap, achieving

$T_\mathrm{DLSGD} \leq \tau\cdot (L \Delta / \varepsilon) + h(L \Delta / \varepsilon + L\sigma^2 \Delta/(n \varepsilon^2))$

which (up to constants and logarithmic factors) is optimal (Fradin et al., 27 Sep 2025). The key correction is adaptive aggregation scaling and the application of adaptive, decaying local stepsizes.

Federated and Decentralized Learning

For problems with data heterogeneity, using a decaying strategy for the number of local steps or the stepsize helps balance client drift and communication-reduction. When client objectives have bounded second- or third-order heterogeneity, decaying Local SGD can break the classical $\Theta(1/\varepsilon)$ communication barrier in certain parameter regimes and deliver improved generalization (Murata et al., 2021, Patel et al., 19 May 2024).

4. Communication Efficiency and Trade-offs

Decaying Local SGD directly optimizes the trade-off between communication rounds and convergence:

Variant/type	Communication reduction	Potential trade-off
Constant $H$	O(T/H) rounds; fixed drift	May stall if $H$ too large
Decaying $H$ / $K$	Early: fewer rounds; late: more comm.	Gradually mitigates drift
Adaptive step-size	Larger steps early reduce $\#$ rounds	Variance controlled by decay
Decaying intervals (Spiridonoff et al., 2021)	O(N) comms (independent of $T$ )	Linear speedup with minimal comm.

Empirical studies show that increasing $H$ (more local updates) initially accelerates training due to reduced communication but can degrade generalization and convergence quality as model drift accumulates (Ortiz et al., 2021). Decaying $H$ ensures that as training nears convergence, more frequent synchronization removes accumulated error.

In decentralized networks, a “shrinking local update” schedule that halves the local computation period repeatedly ensures that residual error remains bounded and convergence quality is retained (Li et al., 2019).

5. Generalization and Practical Implications

Decaying Local SGD schedules confer advantages beyond communication savings:

Implicit Regularization and Generalization Enhancement: Analytical results based on stochastic differential equation (SDE) approximations show that, at small learning rates and with sufficient training, repeated local updates induce a stronger drift along minimizer manifolds, biasing toward flatter solutions that generalize better (Gu et al., 2023). Post-local SGD (switching to local updates after an initial phase) matches or improves generalization compared to large-batch synchronized SGD (Lin et al., 2018).
Federated Learning: Decaying the number of local SGD steps reduces computation and wall-clock time in federated settings while controlling client drift. In non-IID regimes, decaying $K$ per round delivers improved convergence and generalization with reduced local computation (sometimes less than 10% of fixed-K baselines) (Mills et al., 2023).
Critical Batch Size and Decaying Learning Rate: The existence of an optimal batch size that minimizes stochastic first-order oracle complexity under decaying learning rates justifies using decay schedules for both learning rate and local batch size, ensuring lowest cumulative computational cost (Imaizumi et al., 23 Feb 2024).
Extensibility: Recent work demonstrates that time-optimal decaying local step-size rules can be integrated into asynchronous and other local-update-based methods to achieve optimality in a broad class of distributed optimization algorithms (Fradin et al., 27 Sep 2025).

6. Limitations, Open Problems, and Model Assumptions

While decaying Local SGD methods offer significant advantages, theoretical results highlight key boundaries:

Model Drift and Fixed-Point Error: Under only first-order data heterogeneity, a non-vanishing error proportional to local update length persists, limiting the efficacy of extended local updates. Higher-order smoothness assumptions (bounded Hessian difference, Lipschitz Hessian) materially improve the theoretical guarantees (Patel et al., 19 May 2024).
Parameter Sensitivity: Improper schedule tuning (decay too fast or slow, or switching to local SGD at high learning rates) can degrade performance, both in accuracy and stability (Lin et al., 2018, Ortiz et al., 2021).
Heterogeneity Challenges: While decaying strategies mitigate drift, their superiority over mini-batch SGD is pronounced only in regimes of low second- and third-order heterogeneity (Murata et al., 2021, Patel et al., 19 May 2024).

Open problems include:

Delineating the precise problem class where Local SGD (with decaying schedules) strictly outperforms mini-batch SGD.
Designing adaptively tuned, data-dependent decay rules that balance communication, computation, and generalization (Stich, 2018, Li et al., 2019).
Extending optimal time-complexity results and adaptive step-size frameworks to highly nonconvex and non-IID scenarios (Fradin et al., 27 Sep 2025, Patel et al., 19 May 2024).

A spectrum of recent algorithmic advances builds upon the principles of decaying Local SGD:

Compression and Sparsification: Qsparse-Local-SGD combines local updates with quantization and sparsification, using error compensation and decaying learning rates to retain optimal convergence despite lossy communication (Basu et al., 2019).
Overlap and Asynchronous Methods: Overlap Local SGD and asynchronous Local SGD with adaptive local update control reduce idle time and better utilize compute resources, with convergence guarantees comparable to synchronous approaches (Wang et al., 2020, Liu et al., 17 Jan 2024).
Second Order Exploitation: Theoretical work establishes that local (and specifically decaying) SGD schemes implicitly leverage second-order information, exhibiting update projections that accelerate convergence in directions with low curvature—sometimes even approaching Newton-like behavior (Pan et al., 2023).
Practical System-Level Benefits: Empirical studies confirm that decaying local update schedules are essential for scalability, especially at exascale or under resource heterogeneity (Ortiz et al., 2021, Avidor et al., 2022).

In sum, decaying Local SGD constitutes a versatile and theoretically grounded family of methods that reconcile the trade-offs between communication efficiency, computational cost, generalization, and scalability in large-scale distributed learning systems. Its further development and analysis continue to reshape the practice and theory of distributed optimization.