Papers
Topics
Authors
Recent
Search
2000 character limit reached

Double Exponential Moving Average (DEMA)

Updated 2 May 2026
  • DEMA is an advanced trend-estimation technique that combines two exponential moving averages to reduce lag in financial time series and gradient-based optimization.
  • It integrates an inner EMA with the current gradient through tailored weights, offering faster responsiveness and improved momentum estimation in optimizers.
  • Empirical studies and theoretical proofs confirm that DEMA-enhanced optimizers achieve optimal regret bounds and accelerated convergence in both SGDM and Adam variants.

The Double Exponential Moving Average (DEMA) is an advanced trend-estimation technique crucial in both financial time series analysis and modern gradient-based optimization. Originating in the trading domain, DEMA was adapted in optimization frameworks to provide improved responsiveness and reduced lag compared to conventional exponential moving averages (EMA). The recent Admeta optimizer framework integrates a DEMA variant to enhance both adaptive and non-adaptive momentum optimizers by combining backward-looking and forward-looking strategies, delivering faster convergence and superior empirical performance without violating convergence guarantees (Chen et al., 2023).

1. Standard Exponential Moving Average (EMA) in Optimization

Let {pt}\{p_t\} represent a time series such as iteratively computed gradients gtg_t. The standard exponential moving average StS_t is recursively defined as

St=βSt1+(1β)pt,0β<1,S_t = \beta S_{t-1} + (1-\beta)p_t, \quad 0 \leq \beta < 1,

with common initialization S0=0S_0 = 0. This admits the closed-form

St=(1β)i=1tβtipi.S_t = (1-\beta)\sum_{i=1}^t \beta^{t-i}p_i.

EMA is core to classic momentum-based optimizers. For example:

  • SGDM: Applies EMA to gtg_t to obtain momentum, mt=βmt1+(1β)gtm_t = \beta m_{t-1} + (1-\beta)g_t.
  • Adam/RAdam: Employ EMA for both the first moment (mean of gradients) and a second moment (mean of squared gradients) for adaptive step sizes.

While widely effective, the inherent lag of EMA, scaling roughly as 1/(1β)1/(1-\beta), can limit responsiveness to rapid trend changes in gradient series during training.

2. The Classical DEMA in Financial Applications

In its financial origin (Mulloy, 1994), the DEMA is constructed to counteract EMA's lagging tendency:

DEMAt(p)=2EMAt(p)EMAt(EMA(p)).\mathrm{DEMA}_t(p) = 2\,\mathrm{EMA}_t(p) - \mathrm{EMA}_t(\mathrm{EMA}_\bullet(p)).

Essentially, DEMA amplifies the single EMA by a factor of 2 and subtracts the EMA of the EMA, thus producing a signal that is both smoother and more up-to-date compared to the ordinary EMA. This approach is central to a variety of time series filters and trend-following algorithms.

3. DEMA Variant in Gradient-Based Optimization: The Admeta Approach

The Admeta optimizer (Chen et al., 2023) introduces a tailored DEMA variant more suitable for optimization settings, avoiding the naive repeated application of EMA. Instead, it employs an EMA over a linear combination of the instantaneous gradient and an “inner” EMA. The procedure is as follows:

  • Inner EMA (gtg_t0):

gtg_t1

  • DEMA-style Intermediate Signal (gtg_t2):

gtg_t3

where gtg_t4 and gtg_t5 are determined via

gtg_t6

A decaying bias term gtg_t7 is also described in the original, but is omitted in convergence proofs.

  • Outer EMA (gtg_t8):

gtg_t9

In summary, the resultant DEMA signal used as momentum is

StS_t0

where the outer EMA smooths a corrective blend of the current gradient and its own inner EMA.

4. Algorithmic Structure and Pseudocode

The DEMA variant integrates directly into both AdmetaS (SGD-family) and AdmetaR (Adam/RAdam-family) optimizers. The backward-looking update proceeds as:

gtg_t3

In the Adam-family, the conventional EMA on squared gradients (adaptive variance estimation, StS_t1) is performed on StS_t2 rather than StS_t3.

Typical hyperparameter settings: StS_t4–StS_t5, StS_t6–StS_t7, StS_t8 (Adam-like outer EMA) StS_t9, and St=βSt1+(1β)pt,0β<1,S_t = \beta S_{t-1} + (1-\beta)p_t, \quad 0 \leq \beta < 1,0 (variance EMA rate) St=βSt1+(1β)pt,0β<1,S_t = \beta S_{t-1} + (1-\beta)p_t, \quad 0 \leq \beta < 1,1.

5. Hyperparameter Semantics and Operational Roles

Parameter Role in DEMA Variant Typical Range
St=βSt1+(1β)pt,0β<1,S_t = \beta S_{t-1} + (1-\beta)p_t, \quad 0 \leq \beta < 1,2 Inner EMA memory (I_t speed) St=βSt1+(1β)pt,0β<1,S_t = \beta S_{t-1} + (1-\beta)p_t, \quad 0 \leq \beta < 1,3–St=βSt1+(1β)pt,0β<1,S_t = \beta S_{t-1} + (1-\beta)p_t, \quad 0 \leq \beta < 1,4
St=βSt1+(1β)pt,0β<1,S_t = \beta S_{t-1} + (1-\beta)p_t, \quad 0 \leq \beta < 1,5 Outer EMA/momentum smoothing St=βSt1+(1β)pt,0β<1,S_t = \beta S_{t-1} + (1-\beta)p_t, \quad 0 \leq \beta < 1,6–St=βSt1+(1β)pt,0β<1,S_t = \beta S_{t-1} + (1-\beta)p_t, \quad 0 \leq \beta < 1,7
St=βSt1+(1β)pt,0β<1,S_t = \beta S_{t-1} + (1-\beta)p_t, \quad 0 \leq \beta < 1,8 Current-gradient weight St=βSt1+(1β)pt,0β<1,S_t = \beta S_{t-1} + (1-\beta)p_t, \quad 0 \leq \beta < 1,9
S0=0S_0 = 00 Inner EMA weight S0=0S_0 = 01
S0=0S_0 = 02 Outer EMA in Adam-family S0=0S_0 = 03
S0=0S_0 = 04 EMA on squares in Adam-family S0=0S_0 = 05

S0=0S_0 = 06 and S0=0S_0 = 07 independently control “inner” and “outer” smoothing, allowing DEMA to precisely balance noise reduction and trend responsiveness. The dependence of S0=0S_0 = 08 and S0=0S_0 = 09 on St=(1β)i=1tβtipi.S_t = (1-\beta)\sum_{i=1}^t \beta^{t-i}p_i.0 ensures a mathematically consistent weighting of instantaneous and historical gradients.

A small, decaying bias term St=(1β)i=1tβtipi.S_t = (1-\beta)\sum_{i=1}^t \beta^{t-i}p_i.1 is introduced for warm-up stability, though it is omitted in formal proofs.

6. Comparative Analysis: DEMA Versus Standard EMA

  • Responsiveness: EMA’s fixed exponential weighting leads to a lag that grows with St=(1β)i=1tβtipi.S_t = (1-\beta)\sum_{i=1}^t \beta^{t-i}p_i.2. DEMA’s injection of the instantaneous St=(1β)i=1tβtipi.S_t = (1-\beta)\sum_{i=1}^t \beta^{t-i}p_i.3 term (St=(1β)i=1tβtipi.S_t = (1-\beta)\sum_{i=1}^t \beta^{t-i}p_i.4) atop the inner EMA introduces a corrective mechanism that tracks sudden trend changes more rapidly.
  • Smoothing: Both filters reduce variance; DEMA accepts slightly higher noise for the benefit of responsiveness.
  • Overshoot and Bias: Standard EMA momentum can induce overshoot, especially near optima. DEMA counterbalances this via instantaneous gradient correction, lowering overshoot tendency.
  • Bias Correction: Both mechanisms support standard bias correction (e.g., St=(1β)i=1tβtipi.S_t = (1-\beta)\sum_{i=1}^t \beta^{t-i}p_i.5 in early iterations) to mitigate warm-start bias. In practical setups, the bias from multi-stage EMA in DEMA is negligible after initial warm-up.
  • Implementation: DEMA is realized by replacing the standard momentum/first-moment update with the DEMA-style St=(1β)i=1tβtipi.S_t = (1-\beta)\sum_{i=1}^t \beta^{t-i}p_i.6 term in parameter updates.

7. Theoretical Guarantees and Empirical Significance

Rigorous proofs under standard assumptions (Lipschitz continuity, bounded gradient noise) demonstrate that DEMA-based optimizers match the best-known regret and convergence rates:

  • AdmetaR (Adam-family + DEMA):
    • Convex: St=(1β)i=1tβtipi.S_t = (1-\beta)\sum_{i=1}^t \beta^{t-i}p_i.7
    • Nonconvex: St=(1β)i=1tβtipi.S_t = (1-\beta)\sum_{i=1}^t \beta^{t-i}p_i.8
  • AdmetaS (SGD-family + DEMA):
    • Convex: St=(1β)i=1tβtipi.S_t = (1-\beta)\sum_{i=1}^t \beta^{t-i}p_i.9
    • Nonconvex: gtg_t0

These rates confirm that the DEMA variant used does not degrade convergence, matching theoretical lower bounds for their respective optimizer families.

Bidirectional integration—involving DEMA for backward-looking (trend estimation) and a dynamic lookahead strategy for forward-looking exploration—endows Admeta with improved early-stage convergence speed, reduced final-epoch overshooting, empirical improvements over both pure EMA-momentum (SGDM, Adam) and combined methods (RAdam, Ranger), and full theoretical guarantees (Chen et al., 2023).

8. Context within Optimizer Evolution and Effects on Training Dynamics

The Admeta framework’s bidirectional structure, with DEMA-driven backward estimation and dynamic lookahead, addresses the central tension between stability and adaptation in stochastic gradient optimization. EMA-only momentum can be overly conservative, introducing training lag and excessive “stickiness” near optima; DEMA’s corrective structure accelerates tracking without destabilizing convergence.

The design generalizes across both adaptive and non-adaptive classes, enabling direct drop-in replacement for momentum and first-moment terms in both SGD-type and Adam-type optimizers. Empirical results in the cited work demonstrate consistent performance advantages and robustness across diverse model architectures and optimization landscapes.

A plausible implication is that DEMA’s two-stage trend estimation, combining short-term correction with long-term smoothing, constitutes a broadly applicable improvement mechanism for a wide range of stochastic optimization algorithms. Future directions may include alterations to the gtg_t1, gtg_t2 functional forms or further integration of bidirectional smoothing concepts for even finer tradeoff control.


For implementation details, mathematical proofs, and empirical evaluations, see "Bidirectional Looking with A Novel Double Exponential Moving Average to Adaptive and Non-adaptive Momentum Optimizers" (Chen et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Double Exponential Moving Average (DEMA).