DPWMixer: Dual-Path Time Series Forecasting

Updated 4 December 2025

DPWMixer is a dual-path architecture for long-term time series forecasting that uses a lossless Haar wavelet pyramid to retain both high- and low-frequency details.
It combines a global linear path for capturing macro trends with a local MLP-Mixer path for modeling intricate short-term dynamics.
Adaptive multi-scale fusion with learned weights integrates predictions from all scales, achieving state-of-the-art performance with efficient linear complexity.

DPWMixer is a dual-path architecture for long-term time series forecasting (LTSF), integrating a lossless multi-scale decomposition and two complementary mixing pathways to address both global trend anchoring and local dynamic evolution. Transformer-based models, despite their proficiency in capturing long-range dependencies, are hampered by quadratic complexity and overfitting on sparse data; linear models, conversely, are inadequate for modeling complex non-linear local dynamics. Previous multi-scale approaches, typically reliant on average pooling, suffer from spectral aliasing and irreversible loss of high-frequency information. DPWMixer resolves these issues using an orthogonal Haar wavelet pyramid for lossless decomposition, dual-path mixers per scale, and an adaptive multi-scale fusion. The model delivers state-of-the-art forecasting results across standard benchmarks, with efficient linear complexity and effective retention of both global and local temporal features (Qianyang et al., 30 Nov 2025).

1. Architectural Design

DPWMixer comprises three sequential stages: multi-resolution decomposition, dual-path mixing, and adaptive fusion.

Multi-Resolution Decomposition: Input $X \in \mathbb{R}^{B \times L \times C}$ is processed through instance-wise normalization (RevIN), followed by a lossless Haar Wavelet Pyramid generating N+1 scales: $\{X^{(0)}, X^{(1)}, \ldots, X^{(N)}\}$ .
Dual-Path Mixing: Each scale is independently routed through:
- Global Linear Path: Projects the full sequence to the prediction horizon, anchoring macro-trend information.
- Local MLP-Mixer Path: Subdivides the sequence into non-overlapping patches and processes them with stacked MLP-Mixer layers, capturing micro-dynamics.
Adaptive Multi-Scale Fusion: Outputs from all scales are integrated using learned, channel-wise softmax fusion weights, yielding the final prediction through a weighted sum. De-normalization (RevIN $^{-1}$ ) restores each channel’s original distribution.

ASCII Block Diagram

Input X (B×L×C)
   │
   ▼ RevIN (instance-wise normalization)
   │
   ▼ Haar Wavelet Pyramid
      ├─ Scale 0 (X⁽⁰⁾)
      ├─ Scale 1 (X⁽¹⁾)
      ├─ … 
      └─ Scale N (X⁽ᴺ⁾)
   │
   ▼ Dual-Path Trend Mixer (each scale j=0…N)
      ├─ Global Linear Path → H_global⁽ʲ⁾
      ├─ Local MLP-Mixer Path → H_local⁽ʲ⁾
      └─ Gate fusion → Ŷ⁽ʲ⁾ = w_g·H_global⁽ʲ⁾ + w_l·H_local⁽ʲ⁾
   │
   ▼ Adaptive Fusion 
      Ŷ = ∑_{j=0}^N W_fusion[j,·] ⊙ Ŷ⁽ʲ⁾
   │
   ▼ De-Normalize (RevIN⁻¹) → final Ŷ (B×T×C)

2. Lossless Haar Wavelet Pyramid

The multi-scale decomposition applies fixed Haar wavelet filters, ensuring orthogonal separation of trends and local fluctuations without loss of information or spectral aliasing. At each scale $j$ :

Filters: $k_{low} = [\frac{1}{\sqrt{2}}, \frac{1}{\sqrt{2}}],\ k_{high} = [\frac{1}{\sqrt{2}}, -\frac{1}{\sqrt{2}}]$ .
Computation:
- $X^{(j+1)} = (X^{(j)} \ast k_{low})\ \downarrow2$
- $D^{(j+1)} = (X^{(j)} \ast k_{high})\ \downarrow2$
"↓2" denotes strided convolution (stride=2) with symmetric edge padding.

By Parseval’s theorem and Haar orthogonality, perfect reconstruction is possible and aliasing is prevented. This contrasts with previous reliance on average pooling, which discards high-frequency details and induces spectral contamination.

3. Dual-Path Trend Mixer

Each scale is modeled with two parallel, gated pathways:

(a) Global Linear Path

Captures long-horizon macro trends by linear projection:

$H_{global}^{(j)} = W_{lin}^{(j)} \cdot \mathrm{vec}(X^{(j)}) + b_{lin}^{(j)}$

where $W_{lin}^{(j)}$ is shared across channels.

(b) Local Evolution Path (Patch-based MLP-Mixer)

Patching: Split $X^{(j)} \in \mathbb{R}^{L^{(j)} \times C}$ into $N_p = L^{(j)}/P$ patches of length $P$ , yielding $Z_{patch} \in \mathbb{R}^{N_p \times P \times C}$ .
Embedding: Linear mapping to hidden dimension $D$ .
Mixer Layers (typically $K=4$ $K = 4$ –$6$):
- Token-Mixing: $U = Z + \mathrm{MLP}_{token}(\mathrm{LN}(Z))$
- Channel-Mixing: $Z_{out} = U + \mathrm{MLP}_{channel}(\mathrm{LN}(U))$
Output: Flatten and project to horizon, resulting in $H_{local}^{(j)}$ .

(c) Fusion

Per-scale output is combined via gated fusion:

$\hat{Y}^{(j)} = w_g^{(j)} \cdot H_{global}^{(j)} + w_l^{(j)} \cdot H_{local}^{(j)}$

with $w_g$ , $w_l$ as learned scalars.

4. Adaptive Multi-Scale Fusion and Output Synthesis

Following per-scale mixing, DPWMixer learns a $(N+1) \times C$ fusion weight matrix $A$ :

$W_{fusion}[j, c] = \frac{\exp(A_{j, c})}{\sum_{n=0}^N \exp(A_{n, c})}$

Forecasts across scales are combined as:

$\hat{Y}_{norm} = \sum_{j=0}^N W_{fusion}[j] \odot \hat{Y}^{(j)}$

Restoration of original data statistics is achieved via RevIN $^{-1}$ .

5. Training Paradigm and Regularization

Objective: Mean Squared Error (MSE):

$\ell(\hat{Y}, Y) = \frac{1}{BTC} \|\hat{Y} - Y\|_2^2$

Distribution Alignment: RevIN normalization and de-normalization.
Optimization: Adam optimizer, cosine-annealing learning rate schedule, early stopping (patience=5).
Regularization: No extra weight decay beyond Adam defaults.

6. Computational Complexity

DPWMixer exhibits linear time complexity: for total cost $C_{total}(L)$ per scale, over $N+1$ scales,

$\sum_{j=0}^N C_{total}(L / 2^j) \leq 2 \cdot C_{total}(L)$

All mixer and linear operations scale linearly in sequence length. Vanilla Transformers require $O(L^2)$ (or $O(L\log L)$ for sparse).

7. Benchmark Evaluation and Ablation Insights

Datasets and Implementation

Benchmarks: ETTh1/2, ETTm1/2, Electricity, Weather, Exchange, Traffic.
Metrics: MSE, MAE (lower is better).
PyTorch implementation; key parameters: $N=3$ , $P=16$ , $D=128$ , batch $\in \{16,32,64\}$ , max epochs=10.

Performance Summary

Dataset (average T)	DPWMixer (MSE)	Next best
ETTm2 (T=96)	0.169	iTrans 0.180
Electricity (avg)	0.177	iTrans 0.178
Weather (avg)	0.240	iTrans 0.258

DPWMixer ranks 1st in 44/64 MSE and 24/32 MAE settings. Qualitatively, on Electricity (T=192/336), DPWMixer maintains fidelity to peaks and troughs, unlike the smoothing observed in Transformer outputs.

Ablation Study

Component removed	Electricity ΔMSE	ETTm1 ΔMSE
w/o Wavelet (→ avg pool)	+11.8%	+6.5%
w/o Global Path	+7.5%	+8.9%
w/o Local Path	+13.5%	+8.1%
w/o Adaptive Fusion	+3.2%	+2.1%

Each sub-module yields a notable degradation upon removal, confirming necessity.

8. Sensitivity Analysis and Implementation Guidance

Wavelet scales $H$ : On ETTh2 and Weather, MSE improves as $H$ increases to 3, then saturates or worsens beyond (over-downsampling)—optimum at $H=3$ .
Parameters: $N=3$ for $L=96$ ; patch length $P=16$ balances local context and computational tractability; $D=128$ hidden dim and $K=4$ –$6$ Mixer layers recommended.
Normalization: RevIN critical for domain transfer.
Training: Learning rate warmup and early stopping (converges in 5–8 epochs).
Reproducibility: Open-source code and weights available at https://github.com/hit636/DPWMixer.

A plausible implication is that DPWMixer’s systematic disentanglement and multi-path mixing strategy can be extended or adapted to other multi-scale time series tasks where trend-fluctuation separation is critical (Qianyang et al., 30 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

DPWMixer: Dual-Path Wavelet Mixer for Long-Term Time Series Forecasting (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to DPWMixer.