Papers
Topics
Authors
Recent
Search
2000 character limit reached

Entropy-Based Gradient Filtering

Updated 26 April 2026
  • Entropy-based gradient filtering is a set of techniques that use information-theoretic entropy measures to adapt, modify, or compress gradients for improved optimization and robustness.
  • Methods such as GMEE, Entropy-SGD, and entropy-guided compression employ tailored entropy metrics to handle noise, favor flat minima, and reduce communication overhead in distributed systems.
  • Empirical studies show benefits like enhanced noise resilience, accelerated convergence, and significant communication savings, while also highlighting trade-offs in computational cost and parameter tuning.

Entropy-based gradient filtering encompasses a family of methodologies that use information-theoretic entropy measures to guide, modify, or attenuate gradients in order to improve optimization, stability, or communication efficiency. These techniques exploit entropy in various operational forms, including as an optimization cost function, a metric for adaptively filtering or compressing gradients, or an objective for regularization. Entropy-based filtering arises in diverse domains: adaptive system identification with robust error measures, distributed deep learning, numerical PDEs, and non-convex optimization. Variants include minimization of error entropy for adaptive filtering, using local free entropy to drive SGD into wide minima, enforcing discrete minimum entropy principles in conservation law solvers, and entropy-driven compression of gradients during large-scale distributed LLM training.

1. Generalized Minimum Error Entropy Filtering

A core development in entropy-filtering is the Generalized Minimum Error Entropy (GMEE) adaptive filtering framework, which extends the standard Minimum Error Entropy (MEE) by employing a generalized Gaussian density (GGD) kernel within the information potential cost. Given error samples ei=diwTuie_i = d_i - \mathbf{w}^T \mathbf{u}_i over a window of length LL, the GMEE cost is formulated as

JGMEE(w)=1L2i=nn+L1j=nn+L1Gp,β(eiej),J_{\mathrm{GMEE}}(\mathbf w) = \frac{1}{L^2} \sum_{i=n}^{n+L-1}\sum_{j=n}^{n+L-1} G_{p,\beta}(e_i - e_j),

where Gp,β(x)=p2βΓ(1/p)exp(x/βp)G_{p,\beta}(x) = \frac{p}{2\beta\Gamma(1/p)} \exp(-|x/\beta|^p), with tunable shape p>0p>0 and scale β>0\beta>0. The negative gradient with respect to weights produces an adaptive filtering update:

wn+1=wn+μpβpL2i,jGp,β(eiej)eiejp1sign(eiej)(uiuj).\mathbf w_{n+1} = \mathbf w_n + \mu \frac{p}{\beta^p L^2} \sum_{i,j} G_{p,\beta}(e_i-e_j)|e_i-e_j|^{p-1} \operatorname{sign}(e_i-e_j) (\mathbf{u}_i - \mathbf{u}_j).

For p=2p=2 (Gaussian kernel), GMEE recovers the classic MEE. For p<2p<2 (super-Gaussian), the filter is highly robust to impulsive outliers; for p>2p>2 (sub-Gaussian), it is more attuned to fine variations and converges quickly in noise-limited regimes. GMEE demonstrates superior or equal steady-state EMSE performance relative to MSE and maximum-correntropy criteria for a broad range of noise distributions. Computational complexity of the “full double-sum” GMEE update scales as LL0 multiplications and additions per iteration, but quantized variants (QGMEE) reduce this overhead. Practical success has been observed in acoustic echo cancellation, with 3–5 dB improvement in ERLE over NLMS and correntropy-based filters (He et al., 2021).

2. Entropy-based Gradient Filtering in SGD (Entropy-SGD)

Entropy-SGD reframes gradient-based optimization by targeting local free-entropy objectives instead of raw empirical loss. The key object is the local free-entropy centered at LL1, parameterized by scope LL2 and inverse temperature LL3:

LL4

The gradient of LL5 is

LL6

where LL7 is a local Gibbs measure centered at LL8. In high-dimensional settings, this expectation is approximated efficiently by a short run of Stochastic Gradient Langevin Dynamics (SGLD). The outer update moves LL9 toward the Polyak average of the SGLD trajectory, effectively filtering the gradient to prioritize flat minima. This biases optimization toward wide valleys in the loss landscape, leading to solutions with better uniform stability and generalization, as proved via smoothing and uniform stability theorems and observed empirically across diverse network architectures (Chaudhari et al., 2016).

3. Entropy-Guided Adaptive Filtering in Numerical PDEs

For shock-capturing in discontinuous spectral element methods (DSEM), entropy-based adaptive filtering ensures robust solution quality by enforcing strict entropy and positivity constraints. The approach operates in modal space: for a solution in basis JGMEE(w)=1L2i=nn+L1j=nn+L1Gp,β(eiej),J_{\mathrm{GMEE}}(\mathbf w) = \frac{1}{L^2} \sum_{i=n}^{n+L-1}\sum_{j=n}^{n+L-1} G_{p,\beta}(e_i - e_j),0, each modal coefficient JGMEE(w)=1L2i=nn+L1j=nn+L1Gp,β(eiej),J_{\mathrm{GMEE}}(\mathbf w) = \frac{1}{L^2} \sum_{i=n}^{n+L-1}\sum_{j=n}^{n+L-1} G_{p,\beta}(e_i - e_j),1 is filtered by

JGMEE(w)=1L2i=nn+L1j=nn+L1Gp,β(eiej),J_{\mathrm{GMEE}}(\mathbf w) = \frac{1}{L^2} \sum_{i=n}^{n+L-1}\sum_{j=n}^{n+L-1} G_{p,\beta}(e_i - e_j),2

where JGMEE(w)=1L2i=nn+L1j=nn+L1Gp,β(eiej),J_{\mathrm{GMEE}}(\mathbf w) = \frac{1}{L^2} \sum_{i=n}^{n+L-1}\sum_{j=n}^{n+L-1} G_{p,\beta}(e_i - e_j),3 is mode polynomial order and JGMEE(w)=1L2i=nn+L1j=nn+L1Gp,β(eiej),J_{\mathrm{GMEE}}(\mathbf w) = \frac{1}{L^2} \sum_{i=n}^{n+L-1}\sum_{j=n}^{n+L-1} G_{p,\beta}(e_i - e_j),4 the filter strength. The JGMEE(w)=1L2i=nn+L1j=nn+L1Gp,β(eiej),J_{\mathrm{GMEE}}(\mathbf w) = \frac{1}{L^2} \sum_{i=n}^{n+L-1}\sum_{j=n}^{n+L-1} G_{p,\beta}(e_i - e_j),5 is adaptively chosen using a local discrete minimum entropy principle: on each element JGMEE(w)=1L2i=nn+L1j=nn+L1Gp,β(eiej),J_{\mathrm{GMEE}}(\mathbf w) = \frac{1}{L^2} \sum_{i=n}^{n+L-1}\sum_{j=n}^{n+L-1} G_{p,\beta}(e_i - e_j),6, the filtered field must satisfy positivity (e.g., JGMEE(w)=1L2i=nn+L1j=nn+L1Gp,β(eiej),J_{\mathrm{GMEE}}(\mathbf w) = \frac{1}{L^2} \sum_{i=n}^{n+L-1}\sum_{j=n}^{n+L-1} G_{p,\beta}(e_i - e_j),7) and

JGMEE(w)=1L2i=nn+L1j=nn+L1Gp,β(eiej),J_{\mathrm{GMEE}}(\mathbf w) = \frac{1}{L^2} \sum_{i=n}^{n+L-1}\sum_{j=n}^{n+L-1} G_{p,\beta}(e_i - e_j),8

at all quadrature points, where JGMEE(w)=1L2i=nn+L1j=nn+L1Gp,β(eiej),J_{\mathrm{GMEE}}(\mathbf w) = \frac{1}{L^2} \sum_{i=n}^{n+L-1}\sum_{j=n}^{n+L-1} G_{p,\beta}(e_i - e_j),9 is a convex numerical entropy and Gp,β(x)=p2βΓ(1/p)exp(x/βp)G_{p,\beta}(x) = \frac{p}{2\beta\Gamma(1/p)} \exp(-|x/\beta|^p)0 is the minimum among Gp,β(x)=p2βΓ(1/p)exp(x/βp)G_{p,\beta}(x) = \frac{p}{2\beta\Gamma(1/p)} \exp(-|x/\beta|^p)1 and all face-adjacent neighbors. Gp,β(x)=p2βΓ(1/p)exp(x/βp)G_{p,\beta}(x) = \frac{p}{2\beta\Gamma(1/p)} \exp(-|x/\beta|^p)2 is determined per stage by scalar root-finding (bisection), keeping the cost low and activating the filter only where needed. On hyperbolic and mixed hyperbolic-parabolic problems (e.g., Euler, Navier–Stokes), this filter achieves shock resolution, robust enforcement of positivity, and preservation of high-order accuracy in smooth regions. The implementation is element-local and parallelizable, making it suitable for large unstructured meshes (Dzanic et al., 2022).

4. Entropy-Driven Dynamic Gradient Compression in Large-Scale Distributed Training

In large-scale distributed LLM training, communication of gradients is a bottleneck. EDGC (Entropy-driven Dynamic Gradient Compression) exploits the empirical entropy of gradient distributions to adapt the compression rate. The workflow involves efficiently estimating the gradient entropy via down-sampling, combining this with a theoretical model (using Marchenko–Pastur law and properties of normal entropy) linking the entropy change Gp,β(x)=p2βΓ(1/p)exp(x/βp)G_{p,\beta}(x) = \frac{p}{2\beta\Gamma(1/p)} \exp(-|x/\beta|^p)3 to the necessary low-rank for PowerSGD-based gradient compression:

Gp,β(x)=p2βΓ(1/p)exp(x/βp)G_{p,\beta}(x) = \frac{p}{2\beta\Gamma(1/p)} \exp(-|x/\beta|^p)4

where Gp,β(x)=p2βΓ(1/p)exp(x/βp)G_{p,\beta}(x) = \frac{p}{2\beta\Gamma(1/p)} \exp(-|x/\beta|^p)5 maps rank to estimated compression error. The rank Gp,β(x)=p2βΓ(1/p)exp(x/βp)G_{p,\beta}(x) = \frac{p}{2\beta\Gamma(1/p)} \exp(-|x/\beta|^p)6 is dynamically updated across communication windows, subject to maximum change rate per window (Gp,β(x)=p2βΓ(1/p)exp(x/βp)G_{p,\beta}(x) = \frac{p}{2\beta\Gamma(1/p)} \exp(-|x/\beta|^p)7), global rank bounds, and stage alignment across pipeline stages. When gradient entropy drops (tighter, more compressible distributions), Gp,β(x)=p2βΓ(1/p)exp(x/βp)G_{p,\beta}(x) = \frac{p}{2\beta\Gamma(1/p)} \exp(-|x/\beta|^p)8 is reduced, increasing communication efficiency. Execution on multi-billion-parameter models (GPT2-2.5B, GPT2-12.1B) showed up to 45% communication reduction and 14–16% end-to-end speedup, with maintained LLM accuracy. No compression is applied during early training (“warm-up”) until gradients stabilize (Yi et al., 13 Nov 2025).

5. Comparative Performance and Practical Implications

Entropy-based gradient filtering methodologies are characterized by their adaptivity to distributional properties of error signals, gradients, or physical fields. The choice of entropy metric and its operationalization (e.g., GGD shape parameter, local entropy scope, rank-adaptive compression) directly impacts the robustness and efficiency of the filtering process.

Methodology Entropy Role Key Application Domains
GMEE adaptive filtering (He et al., 2021) Error entropy (GGD) Adaptive filtering, echo cancellation
Entropy-SGD (Chaudhari et al., 2016) Local free entropy Deep learning, generalization
Positivity-preserving filter (Dzanic et al., 2022) Discrete field entropy Numerical PDEs
EDGC (Yi et al., 13 Nov 2025) Gradient distribution entropy Distributed training, LLMs

Performance gains occur most saliently when the entropy measure aligns with the underlying distributional features of the target domain: heavy-tailed noise (GMEE), flat minima (Entropy-SGD), local shocks/discontinuities (DSEM), or dynamic compressibility of gradients (EDGC). A plausible implication is that further generalization of entropy measures to model-specific priors could yield even greater benefits but requires careful calibration to task and dataset geometry.

6. Limitations and Trade-offs

While entropy-based gradient filtering offers robustness, adaptivity, and efficiency, trade-offs exist. For GMEE and similar methods, Gp,β(x)=p2βΓ(1/p)exp(x/βp)G_{p,\beta}(x) = \frac{p}{2\beta\Gamma(1/p)} \exp(-|x/\beta|^p)9 computational cost (full double sum) can be high, but quantized approximations mitigate this. In distributed LLM training, early-stage gradients may have high entropy and resist compression, necessitating a warm-up phase with minimal filtering. Excessively aggressive compression (too low p>0p>00) can harm convergence or perplexity, but adaptive entropy control in EDGC avoids these pitfalls. All approaches depend on accurate, low-bias entropy estimation, requiring careful parameterization of down-sampling or bisection schemes.

7. Connections and Broader Impact

Entropy-based gradient filtering unites error-entropy criteria in adaptive filtering, entropy-regularized deep optimization, physically-motivated field filters for conservation laws, and adaptive communication schemes in distributed training under the unifying principle of information content modulation. This theme is consistent with broader trends toward information-theoretic regularization, robust optimization, and resource-efficient large-scale learning. Establishing formal bridges between these applied settings is an ongoing research direction, with potential implications for the design of universally adaptive, entropy-aware algorithms across signal processing, scientific computing, and AI systems (He et al., 2021, Chaudhari et al., 2016, Dzanic et al., 2022, Yi et al., 13 Nov 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Entropy-based Gradient Filtering.