Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

173 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Dynamic EMA Proxy Overview

Updated 30 June 2025

Dynamic EMA Proxy is an adaptive framework that uses exponentially weighted moving averages to smooth, predict, and track parameters in noisy, shifting systems.
It employs dynamic decay rates and structural enhancements to overcome the limitations of static EMA, achieving rapid adaptation with long-term convergence.
Applications span deep learning weight averaging, federated learning, and adaptive filtering, offering enhanced robustness and generalization in challenging environments.

A dynamic EMA proxy is a model, estimator, or training process that adaptively leverages exponentially weighted moving averages (EMA) as a means of smoothing, prediction, or parameter tracking in systems exposed to noise, non-stationarity, or distributional shift. Modern research explores both foundational improvements to the EMA mechanism itself—addressing limitations such as residual noise, lack of strong convergence, or inadequate domain adaptation—as well as practical architectures employing dynamic, learned, or proxy EMA models to enhance robustness, generalization, calibration, and adaptivity in real-world machine learning systems.

1. Principles of Exponential Moving Average and Dynamic Proxies

The classical exponential moving average is defined recursively for a process $(x_n)$ as

$\tauema_{n+1} = \gamma \tauema_n + (1 - \gamma) x_{n+1},$

where $\gamma\in(0,1)$ is the smoothing factor. EMA assigns higher weight to recent observations, yielding estimates that are more responsive to new data but may retain residual noise when observations are subject to persistent stochasticity.

A dynamic EMA proxy generalizes this concept by introducing adaptivity or structural enhancements to overcome the static EMA's shortcomings. This may include:

Time-varying decay rates (e.g., $p$ -EMA, where the most recent weight decays subharmonically: $\gamma_n = 1 - (n+1)^{-p}$ for $p \in (1/2,1]$ ), ensuring noise vanishes asymptotically and yielding provable almost sure convergence under mild mixing assumptions (2505.10605).
EMA applied to model weights in deep learning to provide early regularization, better generalization, and increased robustness to label noise or distribution shifts; in this context, the EMA weights act as a "proxy" for the noisy, high-variance SGD-backed model state (2411.18704).
EMA-proxied teacher-student, domain generalization, and adaptive filtering architectures in federated, semi-supervised, or online learning (2301.10114, 2404.13992).

This broader dynamic EMA proxy framework thus encompasses not just a single mathematical rule, but a family of adaptive algorithms where smoothing parameters, update targets, or architectural roles are actively modulated to meet specific performance, robustness, or convergence goals.

2. Stochastic Convergence and Theoretical Guarantees

A key limitation of fixed-decay EMA is its inability to guarantee strong stochastic convergence: the constant weight on new samples means that, even as $n \to \infty$ , the variance of the estimator plateaus rather than vanishing. This is problematic for statistical consistency and denoising.

The $p$ -EMA adaptation addresses this by letting

$\gamma_n = 1 - \frac{1}{(n+1)^p},$

where $p\in(\frac{1}{2},1]$ . The update becomes

$\widehat{\tau}_{n+1} = \gamma_n \widehat{\tau}_n + (1-\gamma_n) x_{n+1}.$

One can show that under standard mixing/autocorrelation conditions on the observed process, the $p$ -EMA estimator $\widehat{\tau}_n$ converges almost surely to the stationary mean, thanks to the vanishing influence $n^{-p}$ of new observations (2505.10605).

This contrasts with classical EMA, where

$\text{Weight of latest datum} = 1-\gamma,$

which is constant, and only arithmetic means have vanishing final weights and thus guarantee strong law of large numbers convergence.

Dynamic EMA proxies with such time- or item-adaptive decay allow algorithms to combine rapid early adaptation with long-term statistical reliability, a property crucial for online stochastic optimization, step-size adaptation in SGD, and adaptive filtering.

3. EMA Proxies in Deep Learning: Dynamics, Benefits, and Practical Guidance

Applying EMA to neural network weights during training yields a proxy model $\bar{\mathbf{x}}_t$ defined by

$\bar{\mathbf{x}}_{t+1} = \alpha \bar{\mathbf{x}}_t + (1-\alpha) \mathbf{x}_{t+1},$

where $\mathbf{x}_{t+1}$ is the current SGD-updated parameter vector (2411.18704). Empirical analyses reveal several important dynamical and practical consequences:

Noise reduction and implicit regularization: EMA smooths out stochastic gradient noise, allowing models to retain higher learning rates longer. This leads to exploration of flatter minima, which improves generalization and reduces overfitting, especially in the presence of noisy labels.
Early, high-quality estimates: EMA model parameters typically achieve their best test performance before SGD iterates with full learning rate decay, enabling earlier stopping and compute savings.
Distinct solutions: EMA-averaged models often converge to different minima or representation manifolds than last-iterate SGD, resulting in improved transferability and predictive stability.
Practical tuning: It is recommended to maintain parallel EMAs with varying decay rates ( $\alpha$ spanning [0.9, 0.9999]), updating less frequently (e.g., every 16 steps), and choosing the best-performing proxy via validation.

A plausible implication is that dynamic EMA proxies—especially when coupled with early stopping and one-shot regularization scheduling (e.g., cosine annealing)—yield more robust, repeatable, and better-calibrated deep learning models than classical approaches reliant on static parameter values.

4. Role in Adaptive, Federated, and Domain-Generalizing Architectures

EMA proxies are instrumental in architecturally advanced settings requiring adaptivity, privacy, communication efficiency, or strong generalization to out-of-distribution data. For instance:

Federated/semi-supervised learning: Systems such as FedSwitch maintain local teacher EMA models on-device, which are adaptively updated and used for pseudo-labeling, with dynamic switching between teacher and student models based on KL-divergence to a prior; this design enables robust semi-supervised learning under strict privacy and communication cost constraints (2301.10114).
Domain generalization via dynamic proxy domains: The Dynamic Proxy Domain (DPD) method uses momentum-EMA-updated models (and secondary, independently-updated proxy learners) to create a rich proxy domain for regularization, improving segmentation/localization under severe source–target domain shifts (2404.13992).
Online multiclass prediction: Dynamic learners in non-stationary data streams deploy per-class or per-item EMA smoothing, or hybrid queue-based and EMA networks, with dynamic learning rates (e.g., DYAL), to maintain near-optimal adaptation to evolving data with space and time complexity bounded by the number of salient items (2402.10142).

These applications demonstrate the value of dynamic EMA proxies as components for architectural, statistical, or privacy-driven adaptivity in contemporary machine learning frameworks.

5. Advancements: Dynamic EMA as Oscillator and Beyond

Recent theoretical frameworks interpret the EMA mechanism as equivalent to a damped harmonic oscillator system, opening paths to a broader class of dynamic EMA proxies under physically-inspired parameterizations (2310.13854): $w_1(t+1) = \alpha w_1^*(t) + (1-\alpha) w_2(t)$

$w_2(t+1) = \beta w_2(t) + (1-\beta) w_1(t)$

Here, $w_1$ and $w_2$ are coupled by spring strength and mass parameters that may themselves evolve or be scheduled, yielding generalized algorithms (such as BELAY) that interpolate and sometimes outperform classical EMA. The possibility to schedule or adapt $k$ , $m_1$ , $m_2$ during training enables new forms of bias–variance trade-off and adaptation to training regime length.

A plausible implication is the emergence of dynamic EMA proxies as an active, stabilizing force within the optimization landscape—able not only to passively filter noise, but also to guide or correct the trajectory of the reference model.

6. Modern Extensions and Scaling Laws

Research into scaling EMA to large-batch training has established a rigorous rule: when increasing the batch size by a factor $\kappa$ , the EMA momentum should be raised to the $\kappa$ 'th power, i.e., $\hat{\rho} = \rho^\kappa$ , to preserve the effective time horizon of the average (2307.13813). This ensures consistent behavior of EMA-based proxies (such as teacher models in self-supervised learning) at any scale, enabling large-batch training without loss of generalization. A further pattern is the rise of switching or multi-timescale EMA proxies (e.g., Switch EMA, SEMA), which periodically inject the EMA weights back into the main model, providing a robust trajectory between sharp and flat optima (2402.09240).

7. Applications Across Modalities and Tasks

Dynamic EMA proxies are now integral to domains such as:

Wireless communications: Linear combinations of multiple EMAs (ELC) calibrate predictions across fast and slow-changing wireless channels, often rivaling artificial neural networks at lower computational cost (2312.07945).
Psychological and medical time series: GNN-based EMA proxies employing dynamic, learned graph structures outperform LSTM baselines in predictive accuracy, individually refining variable dependencies for each person's EMA sequence (2403.19442).
Optimization with min-max games and adversarial training: Omega and related methods replace or augment the traditional extragradient scheme with EMA-smoothed gradients, achieving better stability and convergence properties in stochastic regimes (2306.07905).

These patterns indicate the breadth and depth of dynamic EMA proxies as essential, adaptive, and theoretically sound tools for modern statistical learning under nonstationarity, noise, and resource constraints.

Summary Table: Dynamic EMA Proxy Variants

Variant / Setting	Distinguishing Property	Benefit
$p$ -EMA (2505.10605)	Subharmonic-decay weighting, $p > 0.5$	Almost sure convergence, stronger denoising
BELAY / Oscillator (2310.13854)	2-way coupled EMA-model dynamics	Adaptive stability, parameter schedule invariance
Switch EMA (2402.09240)	Epochic EMA-to-model injection	Enhances optimization via prospected flatness/sharpness
Multi-timescale EMA (ELC etc.)	Linear combinations of EMAs	Tracks slow and fast process trends simultaneously
Graph-structured EMA (2403.19442)	Dynamically learned graph adjacency for MTS	Personalized, adaptive forecast, interpretable proxies
Federated EMA proxies (2301.10114)	Local and global EMA teacher switching	Privacy, robustness in decentralized semi-supervision

Conclusion

Dynamic EMA proxies encompass a broad set of theoretical advances, architectural innovations, and practical techniques in computational statistics and machine learning. By introducing adaptivity—whether in EMA decay, coupling, role, or timescale—the dynamic proxy framework ensures responsive, robust, and convergent prediction or optimization in challenging domains where the underlying data distribution, noise profile, or resource constraints vary across time and environments.