Papers
Topics
Authors
Recent
Search
2000 character limit reached

HTV-Trans: Hierarchical Variational Transformer

Updated 18 January 2026
  • The paper introduces HTV-Trans, which integrates hierarchical latent variable modeling with transformer encoding to enhance forecasting in non-stationary and stochastic multivariate time series.
  • It employs a multi-scale variational framework that captures both long-term trends and local fluctuations, leading to up to 20% lower forecasting error on standard benchmarks.
  • The architecture fuses series stationarization with latent interpolation, enabling robust recovery of non-stationary information and improved performance in long-horizon predictions.

The Hierarchical Time Series Variational Transformer (HTV-Trans) is a variational generative dynamic model specifically designed to address forecasting in multivariate time series (MTS), particularly under challenging non-stationary and stochastic regimes. HTV-Trans integrates a hierarchical probabilistic generative module (HTPGM) with a transformer-based encoder, enabling explicit modeling of multi-scale non-stationarity and stochastic characteristics. This approach recovers intrinsic non-stationary information into temporal dependencies and has demonstrated superior forecasting performance across multiple standard time series datasets (Wang et al., 2024).

1. Architectural Overview

HTV-Trans encompasses two tightly coupled components:

  • Hierarchical Time-series Probabilistic Generative Module (HTPGM): A multi-layer latent-variable architecture akin to a hierarchical variational autoencoder (VAE), designed to capture both stochasticity and non-stationary structure across multiple temporal scales.
  • Transformer Encoder with MLP Forecaster: Processes the stationarized input sequence, with self-attention representations augmented by a scalar-weighted interpolation of the latent summaries from HTPGM. This fusion injects non-stationary distributional cues into the deterministic transformer pathway.

For a given input series XnRT×VX_n \in \mathbb{R}^{T\times V}, the processing sequence is as follows:

  1. Series Stationarization: Apply sliding-window normalization to obtain XnX_n'.
  2. Transformer Encoding: Feed XnX_n' to the transformer, with each attention input merged with an upsampled sum of latent states from HTPGM (modulated by α\alpha).
  3. Forecasting Head: Deterministic latent states h1:T,nh_{1:T,n}, generated by the transformer, are used by a simple MLP for prediction and act as part of the generative process.
  4. Hierarchical Inference and Generation: In parallel, an inference network over XnX_n' produces variational posteriors q(zt,nix)q(z^i_{t,n}|x) for each latent tier; generation samples top-down through the hierarchy of priors, conditioned on both higher-level latents and h1:t1,nh_{1:t-1,n}, to reconstruct the original series.

2. Generative and Inference Framework

2.1 Hierarchical Generative Model

The generative pathway maintains LL layers of latent variables {zti}i=1L\{z^i_t\}_{i=1}^L, each potentially operating at its own temporal resolution TiT_i:

  • Top Layer Prior: p(ztL)=N(0,I)p(z^L_t) = \mathcal{N}(0, I) (unconditional Gaussian).
  • Conditional Priors: For i=L1,...,1i = L-1, ..., 1:

p(ztizti+1,ht1)=N(μi(zti+1,ht1),diag(σi(zti+1,ht1)))p(z^i_t | z^{i+1}_t, h_{t-1}) = \mathcal{N}(\mu^i(z^{i+1}_t, h_{t-1}), \operatorname{diag}(\sigma^i(z^{i+1}_t, h_{t-1})))

  • Observation Model: The lowest layer zt1z^1_t decodes to the original data:

p(xtzt1)=N(fdec(zt1),τ2I)p(x_t | z^1_t) = \mathcal{N}(f_{\mathrm{dec}}(z^1_t), \tau^2 I)

where fdecf_{\mathrm{dec}} is a small neural decoder; τ2\tau^2 is a fixed variance.

  • Forecasting Term: An auxiliary prediction distribution p(xTh1:T1)p(x_T | h_{1:T-1}) is added to strengthen long-term predictions.

The full joint is:

p(x1:T,z1:T1:L)=t=1Tp(xtzt1)[i=1L1p(ztizti+1,ht1)]p(ztL)  p(xTh1:T1)γp(x_{1:T}, z^{1:L}_{1:T}) = \prod_{t=1}^T p(x_t | z^1_t) \left[\prod_{i=1}^{L-1} p(z^i_t | z^{i+1}_t, h_{t-1})\right] p(z^L_t) \; p(x_T | h_{1:T-1})^\gamma

where γ>0\gamma > 0 trades off direct reconstruction and forecasting fidelity.

2.2 Variational Inference

The recognition (inference) model uses factorized Gaussians for each latent variable, conditioned only on the normalized windows:

q(z1:T1:Lx^1:T)=t=1Ti=1LN(zti;μqi(x^t),diag(σqi(x^t)))q(z^{1:L}_{1:T} | \hat x_{1:T}) = \prod_{t=1}^T \prod_{i=1}^L \mathcal{N}(z^i_t ; \mu^i_q(\hat x_t), \operatorname{diag}(\sigma^i_q(\hat x_t)))

Latents are sampled via the standard reparameterization trick.

2.3 Evidence Lower Bound (ELBO)

The objective maximizes a multi-term ELBO:

L=t=1TEq(zt1)[logp(xtzt1)]+γEq(z1:T1)[logp(xTh1:T1)]t=1Ti=1LKL(q(ztix^t)p(ztizti+1,ht1))\mathcal{L} = \sum_{t=1}^T \mathbb{E}_{q(z^1_t)}[\log p(x_t|z^1_t)] + \gamma\,\mathbb{E}_{q(z^1_{1:T})}[\log p(x_T | h_{1:T-1})] - \sum_{t=1}^T \sum_{i=1}^L \mathrm{KL}(q(z^i_t | \hat x_t) \| p(z^i_t | z^{i+1}_t,h_{t-1}))

3. Mechanisms for Handling Non-Stationarity and Stochasticity

  • Series Stationarization: Each input window is normalized via sliding-window mean and standard deviation, as in series-stationarization or RevIN. This stabilizes scale but removes local non-stationary structure from the initial transformer input.
  • Hierarchical Latents at Multiple Scales: Latent variable hierarchy allows coarse-to-fine modeling: upper layers (with downsampled time resolution) capture slow, large-scale distributional drift; lower layers capture fast, local shifts. Top-down priors convey non-stationary trends to finer scales.
  • Stochastic Generation: All latent variables are stochastic, supporting noise and multi-modality in time series distributions and enhancing expressivity.
  • Transformer Fusion: Latents are upsampled to TT steps, summed, and interpolated to produce a summary matrix. This is added to the attention input, modulated by scalar α\alpha, thus restoring non-stationary information to the transformer's attention mechanism.

4. Transformer Attention Implementation

The transformer encoder operates on the stationarized, embedded inputs:

X~=Embedding(X^)+αsum\tilde X = \mathrm{Embedding}(\hat X) + \alpha\,\mathrm{sum}

where sum\mathrm{sum} is the interpolated, upsampled latent summary and α\alpha is learnable.

Self-attention is multi-headed:

headi=softmax ⁣(QiKidk)Vi\mathrm{head}_i = \mathrm{softmax}\!\left(\frac{Q_i K_i^\top}{\sqrt{d_k}}\right)V_i

with Qi,Ki,ViQ_i, K_i, V_i = projections of X~\tilde X. Outputs are concatenated and projected through WOW_O.

The final encoder state h1:Th_{1:T} is mapped to the prediction horizon x^T+1:T+H\hat{x}_{T+1:T+H} via an MLP. The design does not use any explicit de-stationary attention; instead, the stochastic latent summary fulfills this function.

5. Training and Inference Procedures

  • Objective: Jointly optimize the ELBO over all model parameters using Adam with a conservative learning rate (e.g., 10410^{-4}), batch size 32\sim 32, and early stopping based on validation performance.
  • Regularization: Only the KL divergence terms in the ELBO are used. No additional weight decay or dropout is critical.
  • Normalization: Predictions are de-normalized at the output stage to restore the original scale.
  • Inference: At test time, each window is processed via the encoder, latents are sampled, passed through the transformer, and the forecast is de-normalized.

6. Empirical Evaluation and Results

HTV-Trans was rigorously evaluated on multiple benchmarks:

Dataset Series Type Prediction Horizons
ETTh1/ETTh2 electricity transformer temp {96,192,336,720}
ETTm1/ETTm2 electricity transformer temp {96,192,336,720}
Weather meteorological (hourly) {96,192,336,720}
ILI weekly flu incidence {24,36,48,60}
Exchange daily currency rates {96,192,336,720}
  • Metrics: MSE and MAE (lower is superior).
  • Baselines: Informer, Autoformer, Fedformer, Pyraformer, Crossformer, Non-stationary Transformer.

Performance: HTV-Trans achieved the lowest MSE/MAE in the majority of settings, especially improving long-horizon (720-step) forecasts by 10–20% over the best transformer baseline. Ablation studies confirm that the dynamic prior in HTPGM, the auxiliary reconstruction term (γ>0\gamma > 0), and optimal balancing α\alpha are all critical. Using 3–5 hierarchical latent layers best balances expressivity and overfitting. Visualizations illustrate the model’s superior ability to capture spikes, trend shifts, and volatility.

7. Significance and Implications

Explicitly modeling both non-stationarity (through a hierarchical latent structure) and stochasticity (via variational sampling) confers advantages for MTS forecasting, especially for complex, long-range, or distribution-shifting sequences. HTV-Trans demonstrates that the inductive biases needed for robust multivariate time series forecasting extend beyond stationarization, requiring structured latent generative modeling integrated with state-of-the-art sequence models. These results establish a new empirical and methodological baseline for probabilistic time series forecasting under non-stationary conditions (Wang et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hierarchical Time Series Variational Transformer (HTV-Trans).