Interleaved Dual-Branch Probability Distribution Network

Updated 5 December 2025

interPDN is a deep learning architecture for probabilistic time series forecasting that models predictions as full discrete distributions using dual interleaved branches.
It leverages fine and coarse temporal branches with self-supervised consistency losses to enhance prediction accuracy, quantization robustness, and uncertainty calibration.
Empirical results show interPDN achieves state-of-the-art performance on multiple multivariate benchmarks with significant error reductions and efficient inference.

The interleaved dual-branch Probability Distribution Network (interPDN) is a deep learning architecture for time series forecasting (TSF) that reconceptualizes the output at each forecast step as a full discrete probability distribution, rather than as a scalar prediction. The model utilizes a dual-branch mechanism at both fine and coarse temporal resolutions, interleaved categorical support sets, and multiple self-supervised consistency constraints to enhance prediction accuracy, quantization robustness, and uncertainty calibration. It achieves state-of-the-art (SOTA) empirical results on a broad range of multivariate time series benchmarks, offering a principled solution for uncertainty-aware TSF without restrictive parametric assumptions on the predictive distribution (Kong et al., 28 Nov 2025).

1. Architectural Design and Backbone

interPDN comprises four parallel backbone branches: two at the fine (original) time scale, and two at a coarser temporal scale obtained by downsampling by a factor $k$ . No backbone weights are shared across branches.

Each branch processes every channel independently, employing the following channel-wise pipeline:

Normalization and Decomposition: RevIN instance normalization followed by exponential moving average-based decomposition into trend and seasonal components.
Trend extraction: Trend is modeled through two Linear → Pool → LayerNorm blocks.
Seasonal extraction: The seasonal component is derived via temporal patching and passes through successively a Linear layer, 1D convolution, ResNet stack, and an MLP decoder.
Concatenation: Trend and seasonal outputs are concatenated, yielding $X_{\text{out}} \in \mathbb{R}^{C \times 2T}$ for each branch.

This design allows for channel-independent modeling, which is well suited to multivariate time series with heterogeneous channel dynamics.

2. Discrete Probability Distribution Modeling and Interleaved Support Sets

interPDN’s probabilistic generation module projects each branch's $X_{\text{out}}$ via a fully-connected (fc) layer to $T \times S$ logits $X_f \in \mathbb{R}^{T \times S}$ , where $S$ is the size of the categorical support set. Softmax applied along the $S$ dimension yields a discrete probability distribution $p_t$ at each time step $t$ over support points $Z = \{z_1,\ldots,z_S\}$ .

Support set $Z$ is constructed such that the interval $[-B, B]$ is partitioned into $S-1$ equiprobable intervals under the standard normal cumulative distribution function (CDF). Breakpoints $b_j$ satisfy $F(b_j) = j/(S-1)$ for $j = 0, \ldots, S$ , and the support points $z_j$ are set as the midpoints of these subintervals.

Fine-scale dual branches use interleaved support sets $Sp_1$ and $Sp_2$ : $Sp_2$ is obtained from $Sp_1$ by placing its points at the midpoints between $Sp_1$ 's adjacent values (with added boundaries). This interleaving mitigates quantization error and boundary anomalies that arise if the true target lies near support set boundaries.

3. Fusion and Coarse-Temporal Branches

At each time step, the categorical outputs $p_{1,t}$ (from $Sp_1$ ) and $p_{2,t}$ (from $Sp_2$ ) are fused using a confidence weighting. Defining $e_{i,t} = \max_s p_{i,t,s}$ for branch $i$ at time $t$ and $w_t = e_{1,t}/(e_{1,t} + e_{2,t})$ , the final prediction at $t$ is

$\hat{X}_t = w_t \cdot \mathbb{E}_{Sp_1}[p_{1,t}] + (1 - w_t) \cdot \mathbb{E}_{Sp_2}[p_{2,t}],$

where $\mathbb{E}_{Sp}[p_t] = \sum_{s=1}^S z_s p_{t,s}$ .

Two additional coarse-scale branches operate on the downsampled forecast $\frac{T}{k}$ , projecting $X_{\text{out}}$ to $(T/k) \times S$ logits, yielding distributions $p_{3,u}, p_{4,u}$ over the same pair of support sets. Their fused expectations $X_{sp,3}$ , $X_{sp,4}$ produce a coarse fused signal $X_{sp,c}$ , which serves exclusively as a self-supervised trend anchor for the fine-grained branches.

4. Self-Supervised Consistency Constraints and Loss Formulation

interPDN employs multiple self-supervised losses to regularize learning:

Primary prediction loss: An xPatch-style weighted $L_1$ loss,

$L_p = \frac{1}{T C} \sum_{i=1}^C \sum_{t=1}^T \theta(t) \, | x_{t+i} - \hat{y}_{t+i} |,$

where $\theta(t)$ is an arctan-decayed emphasis on near-term predictions.

Fine-scale dual-branch consistency:

$L_f = \frac{1}{T C} \sum_{i=1}^C \| X_{sp,1}^i - X_{sp,2}^i \|_2^2$

Coarse-scale dual-branch consistency:

$L_c = \frac{k}{T C} \sum_{i=1}^C \| X_{sp,3}^i - X_{sp,4}^i \|_2^2$

Cross-scale consistency: Downsample the fine-scale fused output by average pooling, $X_{d,f}^i = \operatorname{AvgPool}(X_{sp,f}^i; k)$ , and compute

$L_t = \frac{k}{T C} \sum_{i=1}^C \| X_{sp,c}^i - X_{d,f}^i \|_2^2$

Total objective:

$L_{\text{total}} = L_p + \alpha L_f + \beta L_c + \gamma L_t$

where $\alpha, \beta, \gamma \geq 0$ are tuned hyperparameters (typical values: $0.02$–$0.4$).

Consistency losses across and within scales ensure balanced branch behaviors, preventing pathological collapse/divergence and enforcing agreement on forecasts.

5. Training, Inference, and Anomaly Mitigation

Training procedure involves minimizing $L_{\text{total}}$ using Adam or SGD with weight decay and learning rates between $1 \times 10^{-4}$ and $1 \times 10^{-2}$ . Early stopping on validation loss and RevIN denormalization are employed.

Inference is deterministic: for each channel and step, the two fine-branch categorical distributions are fused (see above), and the point forecast is their expectation. No sampling or quantile regression is required.

Anomaly mitigation is achieved through three mechanisms:

Interleaved dual branches address quantization boundary artifacts: when the true value falls between two support points in one branch, the offset grid in the other branch ensures one branch always provides fine-grained coverage.
Consistency losses stabilize training, ensuring neither branch becomes degenerate.
Coarse-scale branches anchor predictions to robust long-term trends, limiting local outlier influence.

6. Empirical Performance and Comparative Results

interPDN's effectiveness is established on nine real-world multivariate datasets (ETTh1, ETTh2, ETTm1, ETTm2, Electricity, Traffic, Weather, Exchange-rate, Illness) with forecast horizons ranging from 24 to 720. Evaluation metrics include mean squared error (MSE), mean absolute error (MAE), CRPS, and MASE.

Compared to prominent TSF baselines (xPatch, RAFT, AMD, MOMENT, TimeMixer, iTransformer, TimesNet, PatchTST, DLinear), interPDN achieves SOTA on 71.1% (MSE) and 84.4% (MAE) of tasks, ranking first on 32/45 by MSE and 38/45 by MAE. It demonstrates 2.44% lower MSE and 1.51% lower MAE than xPatch, and up to 35.15% and 20.27% respective improvements over Transformer-based baselines iTransformer and PatchTST. InterPDN yields a 13.96%, 4.54%, and 15.65% MAE reduction compared to RAFT, AMD, and MOMENT respectively.

Ablation studies show: single-branch probabilistic modeling provides moderate gains, adding interleaved dual branches delivers uniform improvement, incorporating the coarse-scale branch further boosts accuracy, and full interPDN (both features combined) achieves top results. A naïve four-branch scalar ensemble does not match interPDN, underscoring the critical role of probabilistic heads and consistency regularization.

In terms of probabilistic calibration, CRPS is on average within 3.24% of TMDM (diffusion-based TSF SOTA) and 39.4% lower than DiffusionTS. On ETT datasets, interPDN achieves MASE $< 1$ on shorter horizons and >19% MASE reduction versus xPatch on ETTm2. Despite its four-branch design, the backbone is computationally efficient (MLP+1D-Conv); epoch time is 6.8×–10.1× faster than PatchTST/TimesNet, with parameter count and memory requirements comparable or lower.

7. Context, Applications, and Significance

By modeling per-step discrete distributions on interleaved grids and enforcing multi-scale, multi-branch self-supervised constraints, interPDN establishes a new distribution-centric paradigm for TSF. The architecture is robust to quantization error, outlier anomalies, and supports reliable uncertainty quantification without parametric output assumptions. Empirical SOTA and efficiency suggest strong applicability in large-scale TSF tasks across energy, economics, climate, and epidemiology, where both accuracy and calibrated uncertainty are critical for downstream decision-making (Kong et al., 28 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Time Series Forecasting via Direct Per-Step Probability Distribution Modeling (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Interleaved Dual-Branch Probability Distribution Network (interPDN).