Double Exponential Moving Average (DEMA)
- DEMA is an advanced trend-estimation technique that combines two exponential moving averages to reduce lag in financial time series and gradient-based optimization.
- It integrates an inner EMA with the current gradient through tailored weights, offering faster responsiveness and improved momentum estimation in optimizers.
- Empirical studies and theoretical proofs confirm that DEMA-enhanced optimizers achieve optimal regret bounds and accelerated convergence in both SGDM and Adam variants.
The Double Exponential Moving Average (DEMA) is an advanced trend-estimation technique crucial in both financial time series analysis and modern gradient-based optimization. Originating in the trading domain, DEMA was adapted in optimization frameworks to provide improved responsiveness and reduced lag compared to conventional exponential moving averages (EMA). The recent Admeta optimizer framework integrates a DEMA variant to enhance both adaptive and non-adaptive momentum optimizers by combining backward-looking and forward-looking strategies, delivering faster convergence and superior empirical performance without violating convergence guarantees (Chen et al., 2023).
1. Standard Exponential Moving Average (EMA) in Optimization
Let represent a time series such as iteratively computed gradients . The standard exponential moving average is recursively defined as
with common initialization . This admits the closed-form
EMA is core to classic momentum-based optimizers. For example:
- SGDM: Applies EMA to to obtain momentum, .
- Adam/RAdam: Employ EMA for both the first moment (mean of gradients) and a second moment (mean of squared gradients) for adaptive step sizes.
While widely effective, the inherent lag of EMA, scaling roughly as , can limit responsiveness to rapid trend changes in gradient series during training.
2. The Classical DEMA in Financial Applications
In its financial origin (Mulloy, 1994), the DEMA is constructed to counteract EMA's lagging tendency:
Essentially, DEMA amplifies the single EMA by a factor of 2 and subtracts the EMA of the EMA, thus producing a signal that is both smoother and more up-to-date compared to the ordinary EMA. This approach is central to a variety of time series filters and trend-following algorithms.
3. DEMA Variant in Gradient-Based Optimization: The Admeta Approach
The Admeta optimizer (Chen et al., 2023) introduces a tailored DEMA variant more suitable for optimization settings, avoiding the naive repeated application of EMA. Instead, it employs an EMA over a linear combination of the instantaneous gradient and an “inner” EMA. The procedure is as follows:
- Inner EMA (0):
1
- DEMA-style Intermediate Signal (2):
3
where 4 and 5 are determined via
6
A decaying bias term 7 is also described in the original, but is omitted in convergence proofs.
- Outer EMA (8):
9
In summary, the resultant DEMA signal used as momentum is
0
where the outer EMA smooths a corrective blend of the current gradient and its own inner EMA.
4. Algorithmic Structure and Pseudocode
The DEMA variant integrates directly into both AdmetaS (SGD-family) and AdmetaR (Adam/RAdam-family) optimizers. The backward-looking update proceeds as:
3
In the Adam-family, the conventional EMA on squared gradients (adaptive variance estimation, 1) is performed on 2 rather than 3.
Typical hyperparameter settings: 4–5, 6–7, 8 (Adam-like outer EMA) 9, and 0 (variance EMA rate) 1.
5. Hyperparameter Semantics and Operational Roles
| Parameter | Role in DEMA Variant | Typical Range |
|---|---|---|
| 2 | Inner EMA memory (I_t speed) | 3–4 |
| 5 | Outer EMA/momentum smoothing | 6–7 |
| 8 | Current-gradient weight | 9 |
| 0 | Inner EMA weight | 1 |
| 2 | Outer EMA in Adam-family | 3 |
| 4 | EMA on squares in Adam-family | 5 |
6 and 7 independently control “inner” and “outer” smoothing, allowing DEMA to precisely balance noise reduction and trend responsiveness. The dependence of 8 and 9 on 0 ensures a mathematically consistent weighting of instantaneous and historical gradients.
A small, decaying bias term 1 is introduced for warm-up stability, though it is omitted in formal proofs.
6. Comparative Analysis: DEMA Versus Standard EMA
- Responsiveness: EMA’s fixed exponential weighting leads to a lag that grows with 2. DEMA’s injection of the instantaneous 3 term (4) atop the inner EMA introduces a corrective mechanism that tracks sudden trend changes more rapidly.
- Smoothing: Both filters reduce variance; DEMA accepts slightly higher noise for the benefit of responsiveness.
- Overshoot and Bias: Standard EMA momentum can induce overshoot, especially near optima. DEMA counterbalances this via instantaneous gradient correction, lowering overshoot tendency.
- Bias Correction: Both mechanisms support standard bias correction (e.g., 5 in early iterations) to mitigate warm-start bias. In practical setups, the bias from multi-stage EMA in DEMA is negligible after initial warm-up.
- Implementation: DEMA is realized by replacing the standard momentum/first-moment update with the DEMA-style 6 term in parameter updates.
7. Theoretical Guarantees and Empirical Significance
Rigorous proofs under standard assumptions (Lipschitz continuity, bounded gradient noise) demonstrate that DEMA-based optimizers match the best-known regret and convergence rates:
- AdmetaR (Adam-family + DEMA):
- Convex: 7
- Nonconvex: 8
- AdmetaS (SGD-family + DEMA):
- Convex: 9
- Nonconvex: 0
These rates confirm that the DEMA variant used does not degrade convergence, matching theoretical lower bounds for their respective optimizer families.
Bidirectional integration—involving DEMA for backward-looking (trend estimation) and a dynamic lookahead strategy for forward-looking exploration—endows Admeta with improved early-stage convergence speed, reduced final-epoch overshooting, empirical improvements over both pure EMA-momentum (SGDM, Adam) and combined methods (RAdam, Ranger), and full theoretical guarantees (Chen et al., 2023).
8. Context within Optimizer Evolution and Effects on Training Dynamics
The Admeta framework’s bidirectional structure, with DEMA-driven backward estimation and dynamic lookahead, addresses the central tension between stability and adaptation in stochastic gradient optimization. EMA-only momentum can be overly conservative, introducing training lag and excessive “stickiness” near optima; DEMA’s corrective structure accelerates tracking without destabilizing convergence.
The design generalizes across both adaptive and non-adaptive classes, enabling direct drop-in replacement for momentum and first-moment terms in both SGD-type and Adam-type optimizers. Empirical results in the cited work demonstrate consistent performance advantages and robustness across diverse model architectures and optimization landscapes.
A plausible implication is that DEMA’s two-stage trend estimation, combining short-term correction with long-term smoothing, constitutes a broadly applicable improvement mechanism for a wide range of stochastic optimization algorithms. Future directions may include alterations to the 1, 2 functional forms or further integration of bidirectional smoothing concepts for even finer tradeoff control.
For implementation details, mathematical proofs, and empirical evaluations, see "Bidirectional Looking with A Novel Double Exponential Moving Average to Adaptive and Non-adaptive Momentum Optimizers" (Chen et al., 2023).