DPWMixer: Dual-Path Time Series Forecasting
- DPWMixer is a dual-path architecture for long-term time series forecasting that uses a lossless Haar wavelet pyramid to retain both high- and low-frequency details.
- It combines a global linear path for capturing macro trends with a local MLP-Mixer path for modeling intricate short-term dynamics.
- Adaptive multi-scale fusion with learned weights integrates predictions from all scales, achieving state-of-the-art performance with efficient linear complexity.
DPWMixer is a dual-path architecture for long-term time series forecasting (LTSF), integrating a lossless multi-scale decomposition and two complementary mixing pathways to address both global trend anchoring and local dynamic evolution. Transformer-based models, despite their proficiency in capturing long-range dependencies, are hampered by quadratic complexity and overfitting on sparse data; linear models, conversely, are inadequate for modeling complex non-linear local dynamics. Previous multi-scale approaches, typically reliant on average pooling, suffer from spectral aliasing and irreversible loss of high-frequency information. DPWMixer resolves these issues using an orthogonal Haar wavelet pyramid for lossless decomposition, dual-path mixers per scale, and an adaptive multi-scale fusion. The model delivers state-of-the-art forecasting results across standard benchmarks, with efficient linear complexity and effective retention of both global and local temporal features (Qianyang et al., 30 Nov 2025).
1. Architectural Design
DPWMixer comprises three sequential stages: multi-resolution decomposition, dual-path mixing, and adaptive fusion.
- Multi-Resolution Decomposition: Input is processed through instance-wise normalization (RevIN), followed by a lossless Haar Wavelet Pyramid generating N+1 scales: .
- Dual-Path Mixing: Each scale is independently routed through:
- Global Linear Path: Projects the full sequence to the prediction horizon, anchoring macro-trend information.
- Local MLP-Mixer Path: Subdivides the sequence into non-overlapping patches and processes them with stacked MLP-Mixer layers, capturing micro-dynamics.
- Adaptive Multi-Scale Fusion: Outputs from all scales are integrated using learned, channel-wise softmax fusion weights, yielding the final prediction through a weighted sum. De-normalization (RevIN) restores each channel’s original distribution.
ASCII Block Diagram
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
Input X (B×L×C)
│
▼ RevIN (instance-wise normalization)
│
▼ Haar Wavelet Pyramid
├─ Scale 0 (X⁽⁰⁾)
├─ Scale 1 (X⁽¹⁾)
├─ …
└─ Scale N (X⁽ᴺ⁾)
│
▼ Dual-Path Trend Mixer (each scale j=0…N)
├─ Global Linear Path → H_global⁽ʲ⁾
├─ Local MLP-Mixer Path → H_local⁽ʲ⁾
└─ Gate fusion → Ŷ⁽ʲ⁾ = w_g·H_global⁽ʲ⁾ + w_l·H_local⁽ʲ⁾
│
▼ Adaptive Fusion
Ŷ = ∑_{j=0}^N W_fusion[j,·] ⊙ Ŷ⁽ʲ⁾
│
▼ De-Normalize (RevIN⁻¹) → final Ŷ (B×T×C) |
2. Lossless Haar Wavelet Pyramid
The multi-scale decomposition applies fixed Haar wavelet filters, ensuring orthogonal separation of trends and local fluctuations without loss of information or spectral aliasing. At each scale :
- Filters: .
- Computation:
- "↓2" denotes strided convolution (stride=2) with symmetric edge padding.
By Parseval’s theorem and Haar orthogonality, perfect reconstruction is possible and aliasing is prevented. This contrasts with previous reliance on average pooling, which discards high-frequency details and induces spectral contamination.
3. Dual-Path Trend Mixer
Each scale is modeled with two parallel, gated pathways:
(a) Global Linear Path
Captures long-horizon macro trends by linear projection:
where is shared across channels.
(b) Local Evolution Path (Patch-based MLP-Mixer)
- Patching: Split into patches of length , yielding .
- Embedding: Linear mapping to hidden dimension .
- Mixer Layers (typically –$6$):
- Token-Mixing:
- Channel-Mixing:
- Output: Flatten and project to horizon, resulting in .
(c) Fusion
Per-scale output is combined via gated fusion:
with , as learned scalars.
4. Adaptive Multi-Scale Fusion and Output Synthesis
Following per-scale mixing, DPWMixer learns a fusion weight matrix :
Forecasts across scales are combined as:
Restoration of original data statistics is achieved via RevIN.
5. Training Paradigm and Regularization
- Objective: Mean Squared Error (MSE):
- Distribution Alignment: RevIN normalization and de-normalization.
- Optimization: Adam optimizer, cosine-annealing learning rate schedule, early stopping (patience=5).
- Regularization: No extra weight decay beyond Adam defaults.
6. Computational Complexity
DPWMixer exhibits linear time complexity: for total cost per scale, over scales,
All mixer and linear operations scale linearly in sequence length. Vanilla Transformers require (or for sparse).
7. Benchmark Evaluation and Ablation Insights
Datasets and Implementation
- Benchmarks: ETTh1/2, ETTm1/2, Electricity, Weather, Exchange, Traffic.
- Metrics: MSE, MAE (lower is better).
- PyTorch implementation; key parameters: , , , batch , max epochs=10.
Performance Summary
| Dataset (average T) | DPWMixer (MSE) | Next best |
|---|---|---|
| ETTm2 (T=96) | 0.169 | iTrans 0.180 |
| Electricity (avg) | 0.177 | iTrans 0.178 |
| Weather (avg) | 0.240 | iTrans 0.258 |
DPWMixer ranks 1st in 44/64 MSE and 24/32 MAE settings. Qualitatively, on Electricity (T=192/336), DPWMixer maintains fidelity to peaks and troughs, unlike the smoothing observed in Transformer outputs.
Ablation Study
| Component removed | Electricity ΔMSE | ETTm1 ΔMSE |
|---|---|---|
| w/o Wavelet (→ avg pool) | +11.8% | +6.5% |
| w/o Global Path | +7.5% | +8.9% |
| w/o Local Path | +13.5% | +8.1% |
| w/o Adaptive Fusion | +3.2% | +2.1% |
Each sub-module yields a notable degradation upon removal, confirming necessity.
8. Sensitivity Analysis and Implementation Guidance
- Wavelet scales : On ETTh2 and Weather, MSE improves as increases to 3, then saturates or worsens beyond (over-downsampling)—optimum at .
- Parameters: for ; patch length balances local context and computational tractability; hidden dim and –$6$ Mixer layers recommended.
- Normalization: RevIN critical for domain transfer.
- Training: Learning rate warmup and early stopping (converges in 5–8 epochs).
- Reproducibility: Open-source code and weights available at https://github.com/hit636/DPWMixer.
A plausible implication is that DPWMixer’s systematic disentanglement and multi-path mixing strategy can be extended or adapted to other multi-scale time series tasks where trend-fluctuation separation is critical (Qianyang et al., 30 Nov 2025).