Error Feedback in Optimization

Updated 4 July 2026

Error Feedback is a residual memory mechanism that stores and reinjects compression errors, enhancing convergence in distributed and federated optimization systems.
It employs various compressors, including biased and contractive types, to balance error accumulation with gradient updates, ensuring stable optimization dynamics.
Recent advancements like EF21 and federated variants integrate momentum and variance reduction, broadening EF’s applicability and improving performance in complex settings.

Searching arXiv for recent and foundational papers on Error Feedback to ground the article and citations. Error Feedback (EF), also called error compensation, denotes a family of stateful correction mechanisms in which the discrepancy between an intended vector and its compressed, quantized, clipped, or otherwise distorted surrogate is stored and re-injected into later iterations. In contemporary optimization literature, EF is most closely associated with communication-efficient distributed and federated learning under biased or contractive compressors such as Top- $K$ , but related architectures also appear in $\Delta\Sigma$ quantization, compressed second-order preconditioners, and differentially private SGD. Across these domains, EF is best understood not as a single algorithm but as a residual-memory principle whose mathematical behavior depends strongly on smoothness, stochasticity, constraints, and the structure of the underlying distortion operator (Karimireddy et al., 2019, Ohno et al., 2016).

1. Canonical mechanism and virtual-iterate interpretation

A standard EF recursion in smooth unconstrained optimization has the form

$p_t=\gamma g_t+e_t,\qquad \Delta_t=\mathcal C(p_t),\qquad x_{t+1}=x_t-\Delta_t,\qquad e_{t+1}=p_t-\Delta_t.$

Here $g_t$ is a gradient or stochastic gradient, $\mathcal C$ is a compressor, and $e_t$ is the residual memory. The central idea is that compression error is not discarded: whatever is lost when $p_t$ is mapped to $\Delta_t$ becomes the next residual. In distributed variants, the same principle is applied workerwise and then averaged at the server; in two-sided schemes such as dist-EF-SGD, both workers and server maintain residual variables, and the recursion includes the factor $\eta_{t-1}/\eta_t$ when learning rates vary (Gao et al., 3 Oct 2025, Phuong et al., 2020).

The classical analytical device is the virtual iterate

$\tilde x_t:=x_t-e_t.$

Under the additive update above,

$\Delta\Sigma$ 0

Thus the hidden sequence $\Delta\Sigma$ 1 follows the same recursion as uncompressed SGD. This identity explains why EF is effective in smooth unconstrained settings: the actual iterate is a perturbed version of an exact first-order method, and the perturbation is the residual itself. Much of the later EF literature either strengthens this virtual-iterate view or studies precisely where it breaks.

2. Compression models, bias correction, and classical stabilization results

The EF literature distinguishes unbiased compressors from contractive or biased compressors. One common formulation defines an unbiased compressor by

$\Delta\Sigma$ 2

and a contractive compressor by

$\Delta\Sigma$ 3

Equivalent notations appear as $\Delta\Sigma$ 4 for unbiased compression and $\Delta\Sigma$ 5 for biased contractive compression with

$\Delta\Sigma$ 6

Top- $\Delta\Sigma$ 7 is the canonical biased example: it is contractive, with $\Delta\Sigma$ 8 or equivalently $\Delta\Sigma$ 9, but it is not unbiased (Horváth et al., 2020).

The original motivation for EF was that naive biased compression can be unstable. Sign-based and Top- $p_t=\gamma g_t+e_t,\qquad \Delta_t=\mathcal C(p_t),\qquad x_{t+1}=x_t-\Delta_t,\qquad e_{t+1}=p_t-\Delta_t.$ 0-type methods can fail even on simple convex or strongly convex problems, and the 2019 analysis of signSGD showed explicit counterexamples in which the objective increases in expectation or the iterates remain trapped away from the optimum. In that work, EF-SGD with arbitrary compression was shown to achieve the same convergence rate as SGD without additional assumptions, and the smooth nonconvex bound places compression only in a higher-order term (Karimireddy et al., 2019). This established the now-standard viewpoint that the main pathology is not compression per se, but uncorrected biased compression.

A related conclusion appears in later distributed analyses: biased compression alone is not stable in general, whereas EF converts persistent directional distortion into a delayed-transmission effect. This is the historical core of EF’s role in communication-efficient optimization.

3. Modern formulations: EF21, unified bias-variance correction, momentum, and sharper heterogeneity theory

A major reformulation of EF is EF21, which replaces explicit residual accumulation by a recursive gradient estimator. Worker $p_t=\gamma g_t+e_t,\qquad \Delta_t=\mathcal C(p_t),\qquad x_{t+1}=x_t-\Delta_t,\qquad e_{t+1}=p_t-\Delta_t.$ 1 maintains $p_t=\gamma g_t+e_t,\qquad \Delta_t=\mathcal C(p_t),\qquad x_{t+1}=x_t-\Delta_t,\qquad e_{t+1}=p_t-\Delta_t.$ 2 and updates

$p_t=\gamma g_t+e_t,\qquad \Delta_t=\mathcal C(p_t),\qquad x_{t+1}=x_t-\Delta_t,\qquad e_{t+1}=p_t-\Delta_t.$ 3

$p_t=\gamma g_t+e_t,\qquad \Delta_t=\mathcal C(p_t),\qquad x_{t+1}=x_t-\Delta_t,\qquad e_{t+1}=p_t-\Delta_t.$ 4

The compressor is applied not to the full gradient but to the innovation relative to the current local estimate. This yields a one-step contraction recursion for $p_t=\gamma g_t+e_t,\qquad \Delta_t=\mathcal C(p_t),\qquad x_{t+1}=x_t-\Delta_t,\qquad e_{t+1}=p_t-\Delta_t.$ 5, and under standard smoothness and lower-boundedness assumptions EF21 attains an $p_t=\gamma g_t+e_t,\qquad \Delta_t=\mathcal C(p_t),\qquad x_{t+1}=x_t-\Delta_t,\qquad e_{t+1}=p_t-\Delta_t.$ 6 nonconvex rate and a linear rate under the Polyak–Łojasiewicz condition, without bounded-gradient assumptions and without auxiliary unbiased compressors (Richtárik et al., 2021).

The 2021 extension paper showed that the EF21 mechanism supports a broad algorithmic envelope: partial participation, stochastic approximation, variance reduction, proximal composite objectives, heavy-ball momentum, and bidirectional compression all admit explicit convergence theory within the same Markov-compressor framework. In particular, EF21-PAGE supplies a variance-reduced finite-sum analogue, EF21-Prox covers $p_t=\gamma g_t+e_t,\qquad \Delta_t=\mathcal C(p_t),\qquad x_{t+1}=x_t-\Delta_t,\qquad e_{t+1}=p_t-\Delta_t.$ 7 with proximal updates, and EF21-BC compresses both uplink and downlink while improving substantially over earlier bidirectional EF analyses (Fatkhullin et al., 2021).

A different line, EF-BV, places EF and DIANA-style variance reduction inside a single compressor class $p_t=\gamma g_t+e_t,\qquad \Delta_t=\mathcal C(p_t),\qquad x_{t+1}=x_t-\Delta_t,\qquad e_{t+1}=p_t-\Delta_t.$ 8 defined by separate bias and variance controls: $p_t=\gamma g_t+e_t,\qquad \Delta_t=\mathcal C(p_t),\qquad x_{t+1}=x_t-\Delta_t,\qquad e_{t+1}=p_t-\Delta_t.$ 9 Its update

$g_t$ 0

recovers DIANA when $g_t$ 1 and EF21 when $g_t$ 2. The key consequence is that biased compressors and averaging-induced variance reduction need not be treated as disjoint worlds (Condat et al., 2022).

Two later refinements are especially relevant. First, adding Polyak momentum before compression in EF21-SGDM and EF21-SGD2M removes the large-batch requirement that earlier stochastic EF21 analyses needed in the nonconvex regime. The resulting stochastic term matches the desired $g_t$ 3 behavior asymptotically, and the paper’s central message is explicit: momentum provably improves error feedback (Fatkhullin et al., 2023). Second, the 2024 “Reloaded” analysis showed that EF21’s heterogeneity dependence can be improved from the quadratic mean

$g_t$ 4

to the arithmetic mean

$g_t$ 5

of local smoothness constants, via a weighted Lyapunov analysis. This is a strict improvement and can be substantial in heterogeneous regimes (Richtárik et al., 2024).

4. Critiques, impossibility results, and setting-dependent limitations

EF is not uniformly dominant across settings, and a substantial body of work has made that point sharply. One influential critique argues that EF is not the right default mechanism for contractive biased compressors because any contractive compressor $g_t$ 6 can be converted into an induced unbiased compressor

$g_t$ 7

after which standard unbiased-compression methods can be used. In that framework, the induced approach has reduced persistent memory, better communication-complexity guarantees, weaker assumptions, and immediate compatibility with DIANA-style variance reduction and partial participation. The same paper also states that if the compressor is already unbiased, adding EF is generally a bad idea and can hurt empirically (Horváth et al., 2020).

Another critique is structural rather than comparative. In composite optimization with

$g_t$ 8

the classical EF virtual-iterate argument no longer survives the nonlinear proximal map. The paper “Composite Optimization with Error Feedback: the Dual Averaging Approach” states that vanilla EF is not the right abstraction once a non-smooth regularizer or constraints are present, because the composite update destroys the additive structure that underpins standard EF analysis. Its remedy is to move error control into dual accumulation via Dual Averaging + EControl rather than in the primal step itself (Gao et al., 3 Oct 2025).

The nonsmooth constrained convex regime produces an even sharper separation among EF variants. In Safe-EF, compressed subgradient descent can stall on $g_t$ 9, and EF21 can diverge on the same type of nonsmooth problem, whereas an EF14-style residual mechanism combined with a feasibility-preserving switching rule attains the lower bound order $\mathcal C$ 0 in the unidirectional case (Islamov et al., 9 May 2025). Thus “EF” is not a monolithic concept: gradient-tracking EF21, classical residual EF14, and composite-aware dual schemes behave differently outside the smooth unconstrained setting.

A further limitation concerns analysis quality. The 2020 revisit of dist-EF-SGD showed that a widely cited convergence proof under arbitrary learning-rate schedules was mathematically invalid because the memory term scales with $\mathcal C$ 1. The corrected theorem makes the error bound explicitly depend on learning-rate history, recovering validity but not the original schedule-independent lemma (Phuong et al., 2020).

Finally, the 2025 tight worst-case analysis of first-order methods with error feedback found that in the single-agent deterministic $\mathcal C$ 2-smooth $\mathcal C$ 3-strongly convex setting with a deterministic contractive compressor, EF and EF21 have exactly the same optimal worst-case rate and the same optimal stepsize, and both are strictly worse than compressed gradient descent. The conclusion is intentionally narrow, but it is decisive within that model: error feedback is not “compression for free” in every regime (Thomsen et al., 5 Jun 2025).

5. Beyond gradient uplink compression: quantization, preconditioners, and differential privacy

Outside distributed gradient communication, EF appears in several technically distinct roles. In quantization theory, a $\mathcal C$ 4 modulator can be modeled as a static uniform quantizer with an EF filter $\mathcal C$ 5, giving

$\mathcal C$ 6

The 2016 rate-distortion analysis showed that the optimal EF-filter amplitude has the one-parameter form

$\mathcal C$ 7

with

$\mathcal C$ 8

so the achievable MSE decays as $\mathcal C$ 9. In this setting EF is a noise-shaping device rather than an optimizer correction mechanism (Ohno et al., 2016).

In second-order optimization, EF can be applied to the internal state of a preconditioner rather than to communicated gradients. EFCP compresses the gradient history fed into full-matrix preconditioners such as M-FAC and GGT through

$e_t$ 0

and then builds the preconditioner from the compressed history $e_t$ 1. The empirical claim is strong: full-matrix preconditioners can be compressed to up to $e_t$ 2 sparsity without accuracy loss, with one to two orders of magnitude memory savings in practice. The paper is equally clear that it does not prove a new convergence theorem for compressed full-matrix M-FAC or GGT (Modoranu et al., 2023).

In differentially private optimization, DiceSGD uses clipped EF to remove the constant clipping bias of DPSGD-GC. Its update direction is

$e_t$ 3

followed by

$e_t$ 4

The fixed-point argument shows that, unlike standard clipped DP-SGD, DiceSGD does not shift the stationary condition to the clipped-gradient field. The accompanying Rényi-DP analysis is algorithm-specific because the residual is a hidden, nonprivatized state (Zhang et al., 2023).

Recent work has pushed EF into explicitly federated regimes with local steps, heterogeneity, and partial participation. Safe-EF addresses nonsmooth constrained convex optimization with global constraint

$e_t$ 5

by switching between objective and constraint subgradients according to whether $e_t$ 6, while still using an EF14-style residual: $e_t$ 7 Its virtual iterate

$e_t$ 8

satisfies

$e_t$ 9

The resulting deterministic bidirectional-compression guarantee is

$p_t$ 0

and the unidirectional rate matches the paper’s lower bound up to constants (Islamov et al., 9 May 2025).

A 2026 refinement, SA-PEF, modifies EF for non-IID federated learning with local SGD and biased uplink compression by introducing step-ahead partial error feedback. Client $p_t$ 1 first previews a fraction $p_t$ 2 of its residual through

$p_t$ 3

then performs local SGD, and finally compresses

$p_t$ 4

The method recovers EF when $p_t$ 5 and step-ahead EF when $p_t$ 6. Its defining theoretical quantity is the residual contraction factor

$p_t$ 7

which is strictly smaller than the EF baseline $p_t$ 8 for

$p_t$ 9

with optimum

$\Delta_t$ 0

The nonconvex convergence rate matches standard Fed-SGD up to constant factors, while the smaller $\Delta_t$ 1 explains the empirically faster early training phase relative to EF (Redie et al., 28 Jan 2026).

Taken together, these works establish a general pattern. EF remains a central tool for biased compression, but its useful form is regime-dependent. Classical residual EF is robust in some nonsmooth constrained settings; gradient-tracking EF21 is powerful in smooth problems and admits many extensions; dual-averaging or safe-switching variants become necessary once proximal structure or safety constraints enter; and recent federated refinements modify the residual pathway itself to improve transient behavior under non-IID local training. In current literature, the most precise summary is therefore not that EF is universally optimal, but that EF is a residual-correction design principle whose success depends on matching the error pathway to the geometry of the optimization problem.