Papers
Topics
Authors
Recent
Search
2000 character limit reached

Error Feedback in Optimization

Updated 4 July 2026
  • Error Feedback is a residual memory mechanism that stores and reinjects compression errors, enhancing convergence in distributed and federated optimization systems.
  • It employs various compressors, including biased and contractive types, to balance error accumulation with gradient updates, ensuring stable optimization dynamics.
  • Recent advancements like EF21 and federated variants integrate momentum and variance reduction, broadening EF’s applicability and improving performance in complex settings.

Searching arXiv for recent and foundational papers on Error Feedback to ground the article and citations. Error Feedback (EF), also called error compensation, denotes a family of stateful correction mechanisms in which the discrepancy between an intended vector and its compressed, quantized, clipped, or otherwise distorted surrogate is stored and re-injected into later iterations. In contemporary optimization literature, EF is most closely associated with communication-efficient distributed and federated learning under biased or contractive compressors such as Top-KK, but related architectures also appear in ΔΣ\Delta\Sigma quantization, compressed second-order preconditioners, and differentially private SGD. Across these domains, EF is best understood not as a single algorithm but as a residual-memory principle whose mathematical behavior depends strongly on smoothness, stochasticity, constraints, and the structure of the underlying distortion operator (Karimireddy et al., 2019, Ohno et al., 2016).

1. Canonical mechanism and virtual-iterate interpretation

A standard EF recursion in smooth unconstrained optimization has the form

pt=γgt+et,Δt=C(pt),xt+1=xtΔt,et+1=ptΔt.p_t=\gamma g_t+e_t,\qquad \Delta_t=\mathcal C(p_t),\qquad x_{t+1}=x_t-\Delta_t,\qquad e_{t+1}=p_t-\Delta_t.

Here gtg_t is a gradient or stochastic gradient, C\mathcal C is a compressor, and ete_t is the residual memory. The central idea is that compression error is not discarded: whatever is lost when ptp_t is mapped to Δt\Delta_t becomes the next residual. In distributed variants, the same principle is applied workerwise and then averaged at the server; in two-sided schemes such as dist-EF-SGD, both workers and server maintain residual variables, and the recursion includes the factor ηt1/ηt\eta_{t-1}/\eta_t when learning rates vary (Gao et al., 3 Oct 2025, Phuong et al., 2020).

The classical analytical device is the virtual iterate

x~t:=xtet.\tilde x_t:=x_t-e_t.

Under the additive update above,

ΔΣ\Delta\Sigma0

Thus the hidden sequence ΔΣ\Delta\Sigma1 follows the same recursion as uncompressed SGD. This identity explains why EF is effective in smooth unconstrained settings: the actual iterate is a perturbed version of an exact first-order method, and the perturbation is the residual itself. Much of the later EF literature either strengthens this virtual-iterate view or studies precisely where it breaks.

2. Compression models, bias correction, and classical stabilization results

The EF literature distinguishes unbiased compressors from contractive or biased compressors. One common formulation defines an unbiased compressor by

ΔΣ\Delta\Sigma2

and a contractive compressor by

ΔΣ\Delta\Sigma3

Equivalent notations appear as ΔΣ\Delta\Sigma4 for unbiased compression and ΔΣ\Delta\Sigma5 for biased contractive compression with

ΔΣ\Delta\Sigma6

Top-ΔΣ\Delta\Sigma7 is the canonical biased example: it is contractive, with ΔΣ\Delta\Sigma8 or equivalently ΔΣ\Delta\Sigma9, but it is not unbiased (Horváth et al., 2020).

The original motivation for EF was that naive biased compression can be unstable. Sign-based and Top-pt=γgt+et,Δt=C(pt),xt+1=xtΔt,et+1=ptΔt.p_t=\gamma g_t+e_t,\qquad \Delta_t=\mathcal C(p_t),\qquad x_{t+1}=x_t-\Delta_t,\qquad e_{t+1}=p_t-\Delta_t.0-type methods can fail even on simple convex or strongly convex problems, and the 2019 analysis of signSGD showed explicit counterexamples in which the objective increases in expectation or the iterates remain trapped away from the optimum. In that work, EF-SGD with arbitrary compression was shown to achieve the same convergence rate as SGD without additional assumptions, and the smooth nonconvex bound places compression only in a higher-order term (Karimireddy et al., 2019). This established the now-standard viewpoint that the main pathology is not compression per se, but uncorrected biased compression.

A related conclusion appears in later distributed analyses: biased compression alone is not stable in general, whereas EF converts persistent directional distortion into a delayed-transmission effect. This is the historical core of EF’s role in communication-efficient optimization.

3. Modern formulations: EF21, unified bias-variance correction, momentum, and sharper heterogeneity theory

A major reformulation of EF is EF21, which replaces explicit residual accumulation by a recursive gradient estimator. Worker pt=γgt+et,Δt=C(pt),xt+1=xtΔt,et+1=ptΔt.p_t=\gamma g_t+e_t,\qquad \Delta_t=\mathcal C(p_t),\qquad x_{t+1}=x_t-\Delta_t,\qquad e_{t+1}=p_t-\Delta_t.1 maintains pt=γgt+et,Δt=C(pt),xt+1=xtΔt,et+1=ptΔt.p_t=\gamma g_t+e_t,\qquad \Delta_t=\mathcal C(p_t),\qquad x_{t+1}=x_t-\Delta_t,\qquad e_{t+1}=p_t-\Delta_t.2 and updates

pt=γgt+et,Δt=C(pt),xt+1=xtΔt,et+1=ptΔt.p_t=\gamma g_t+e_t,\qquad \Delta_t=\mathcal C(p_t),\qquad x_{t+1}=x_t-\Delta_t,\qquad e_{t+1}=p_t-\Delta_t.3

pt=γgt+et,Δt=C(pt),xt+1=xtΔt,et+1=ptΔt.p_t=\gamma g_t+e_t,\qquad \Delta_t=\mathcal C(p_t),\qquad x_{t+1}=x_t-\Delta_t,\qquad e_{t+1}=p_t-\Delta_t.4

The compressor is applied not to the full gradient but to the innovation relative to the current local estimate. This yields a one-step contraction recursion for pt=γgt+et,Δt=C(pt),xt+1=xtΔt,et+1=ptΔt.p_t=\gamma g_t+e_t,\qquad \Delta_t=\mathcal C(p_t),\qquad x_{t+1}=x_t-\Delta_t,\qquad e_{t+1}=p_t-\Delta_t.5, and under standard smoothness and lower-boundedness assumptions EF21 attains an pt=γgt+et,Δt=C(pt),xt+1=xtΔt,et+1=ptΔt.p_t=\gamma g_t+e_t,\qquad \Delta_t=\mathcal C(p_t),\qquad x_{t+1}=x_t-\Delta_t,\qquad e_{t+1}=p_t-\Delta_t.6 nonconvex rate and a linear rate under the Polyak–Łojasiewicz condition, without bounded-gradient assumptions and without auxiliary unbiased compressors (Richtárik et al., 2021).

The 2021 extension paper showed that the EF21 mechanism supports a broad algorithmic envelope: partial participation, stochastic approximation, variance reduction, proximal composite objectives, heavy-ball momentum, and bidirectional compression all admit explicit convergence theory within the same Markov-compressor framework. In particular, EF21-PAGE supplies a variance-reduced finite-sum analogue, EF21-Prox covers pt=γgt+et,Δt=C(pt),xt+1=xtΔt,et+1=ptΔt.p_t=\gamma g_t+e_t,\qquad \Delta_t=\mathcal C(p_t),\qquad x_{t+1}=x_t-\Delta_t,\qquad e_{t+1}=p_t-\Delta_t.7 with proximal updates, and EF21-BC compresses both uplink and downlink while improving substantially over earlier bidirectional EF analyses (Fatkhullin et al., 2021).

A different line, EF-BV, places EF and DIANA-style variance reduction inside a single compressor class pt=γgt+et,Δt=C(pt),xt+1=xtΔt,et+1=ptΔt.p_t=\gamma g_t+e_t,\qquad \Delta_t=\mathcal C(p_t),\qquad x_{t+1}=x_t-\Delta_t,\qquad e_{t+1}=p_t-\Delta_t.8 defined by separate bias and variance controls: pt=γgt+et,Δt=C(pt),xt+1=xtΔt,et+1=ptΔt.p_t=\gamma g_t+e_t,\qquad \Delta_t=\mathcal C(p_t),\qquad x_{t+1}=x_t-\Delta_t,\qquad e_{t+1}=p_t-\Delta_t.9 Its update

gtg_t0

recovers DIANA when gtg_t1 and EF21 when gtg_t2. The key consequence is that biased compressors and averaging-induced variance reduction need not be treated as disjoint worlds (Condat et al., 2022).

Two later refinements are especially relevant. First, adding Polyak momentum before compression in EF21-SGDM and EF21-SGD2M removes the large-batch requirement that earlier stochastic EF21 analyses needed in the nonconvex regime. The resulting stochastic term matches the desired gtg_t3 behavior asymptotically, and the paper’s central message is explicit: momentum provably improves error feedback (Fatkhullin et al., 2023). Second, the 2024 “Reloaded” analysis showed that EF21’s heterogeneity dependence can be improved from the quadratic mean

gtg_t4

to the arithmetic mean

gtg_t5

of local smoothness constants, via a weighted Lyapunov analysis. This is a strict improvement and can be substantial in heterogeneous regimes (Richtárik et al., 2024).

4. Critiques, impossibility results, and setting-dependent limitations

EF is not uniformly dominant across settings, and a substantial body of work has made that point sharply. One influential critique argues that EF is not the right default mechanism for contractive biased compressors because any contractive compressor gtg_t6 can be converted into an induced unbiased compressor

gtg_t7

after which standard unbiased-compression methods can be used. In that framework, the induced approach has reduced persistent memory, better communication-complexity guarantees, weaker assumptions, and immediate compatibility with DIANA-style variance reduction and partial participation. The same paper also states that if the compressor is already unbiased, adding EF is generally a bad idea and can hurt empirically (Horváth et al., 2020).

Another critique is structural rather than comparative. In composite optimization with

gtg_t8

the classical EF virtual-iterate argument no longer survives the nonlinear proximal map. The paper “Composite Optimization with Error Feedback: the Dual Averaging Approach” states that vanilla EF is not the right abstraction once a non-smooth regularizer or constraints are present, because the composite update destroys the additive structure that underpins standard EF analysis. Its remedy is to move error control into dual accumulation via Dual Averaging + EControl rather than in the primal step itself (Gao et al., 3 Oct 2025).

The nonsmooth constrained convex regime produces an even sharper separation among EF variants. In Safe-EF, compressed subgradient descent can stall on gtg_t9, and EF21 can diverge on the same type of nonsmooth problem, whereas an EF14-style residual mechanism combined with a feasibility-preserving switching rule attains the lower bound order C\mathcal C0 in the unidirectional case (Islamov et al., 9 May 2025). Thus “EF” is not a monolithic concept: gradient-tracking EF21, classical residual EF14, and composite-aware dual schemes behave differently outside the smooth unconstrained setting.

A further limitation concerns analysis quality. The 2020 revisit of dist-EF-SGD showed that a widely cited convergence proof under arbitrary learning-rate schedules was mathematically invalid because the memory term scales with C\mathcal C1. The corrected theorem makes the error bound explicitly depend on learning-rate history, recovering validity but not the original schedule-independent lemma (Phuong et al., 2020).

Finally, the 2025 tight worst-case analysis of first-order methods with error feedback found that in the single-agent deterministic C\mathcal C2-smooth C\mathcal C3-strongly convex setting with a deterministic contractive compressor, EF and EF21 have exactly the same optimal worst-case rate and the same optimal stepsize, and both are strictly worse than compressed gradient descent. The conclusion is intentionally narrow, but it is decisive within that model: error feedback is not “compression for free” in every regime (Thomsen et al., 5 Jun 2025).

Outside distributed gradient communication, EF appears in several technically distinct roles. In quantization theory, a C\mathcal C4 modulator can be modeled as a static uniform quantizer with an EF filter C\mathcal C5, giving

C\mathcal C6

The 2016 rate-distortion analysis showed that the optimal EF-filter amplitude has the one-parameter form

C\mathcal C7

with

C\mathcal C8

so the achievable MSE decays as C\mathcal C9. In this setting EF is a noise-shaping device rather than an optimizer correction mechanism (Ohno et al., 2016).

In second-order optimization, EF can be applied to the internal state of a preconditioner rather than to communicated gradients. EFCP compresses the gradient history fed into full-matrix preconditioners such as M-FAC and GGT through

ete_t0

and then builds the preconditioner from the compressed history ete_t1. The empirical claim is strong: full-matrix preconditioners can be compressed to up to ete_t2 sparsity without accuracy loss, with one to two orders of magnitude memory savings in practice. The paper is equally clear that it does not prove a new convergence theorem for compressed full-matrix M-FAC or GGT (Modoranu et al., 2023).

In differentially private optimization, DiceSGD uses clipped EF to remove the constant clipping bias of DPSGD-GC. Its update direction is

ete_t3

followed by

ete_t4

The fixed-point argument shows that, unlike standard clipped DP-SGD, DiceSGD does not shift the stationary condition to the clipped-gradient field. The accompanying Rényi-DP analysis is algorithm-specific because the residual is a hidden, nonprivatized state (Zhang et al., 2023).

6. Federated learning, constraints, and recent residual refinements

Recent work has pushed EF into explicitly federated regimes with local steps, heterogeneity, and partial participation. Safe-EF addresses nonsmooth constrained convex optimization with global constraint

ete_t5

by switching between objective and constraint subgradients according to whether ete_t6, while still using an EF14-style residual: ete_t7 Its virtual iterate

ete_t8

satisfies

ete_t9

The resulting deterministic bidirectional-compression guarantee is

ptp_t0

and the unidirectional rate matches the paper’s lower bound up to constants (Islamov et al., 9 May 2025).

A 2026 refinement, SA-PEF, modifies EF for non-IID federated learning with local SGD and biased uplink compression by introducing step-ahead partial error feedback. Client ptp_t1 first previews a fraction ptp_t2 of its residual through

ptp_t3

then performs local SGD, and finally compresses

ptp_t4

The method recovers EF when ptp_t5 and step-ahead EF when ptp_t6. Its defining theoretical quantity is the residual contraction factor

ptp_t7

which is strictly smaller than the EF baseline ptp_t8 for

ptp_t9

with optimum

Δt\Delta_t0

The nonconvex convergence rate matches standard Fed-SGD up to constant factors, while the smaller Δt\Delta_t1 explains the empirically faster early training phase relative to EF (Redie et al., 28 Jan 2026).

Taken together, these works establish a general pattern. EF remains a central tool for biased compression, but its useful form is regime-dependent. Classical residual EF is robust in some nonsmooth constrained settings; gradient-tracking EF21 is powerful in smooth problems and admits many extensions; dual-averaging or safe-switching variants become necessary once proximal structure or safety constraints enter; and recent federated refinements modify the residual pathway itself to improve transient behavior under non-IID local training. In current literature, the most precise summary is therefore not that EF is universally optimal, but that EF is a residual-correction design principle whose success depends on matching the error pathway to the geometry of the optimization problem.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Error Feedback (EF).