HTV-Trans: Hierarchical Variational Transformer

Updated 18 January 2026

The paper introduces HTV-Trans, which integrates hierarchical latent variable modeling with transformer encoding to enhance forecasting in non-stationary and stochastic multivariate time series.
It employs a multi-scale variational framework that captures both long-term trends and local fluctuations, leading to up to 20% lower forecasting error on standard benchmarks.
The architecture fuses series stationarization with latent interpolation, enabling robust recovery of non-stationary information and improved performance in long-horizon predictions.

The Hierarchical Time Series Variational Transformer (HTV-Trans) is a variational generative dynamic model specifically designed to address forecasting in multivariate time series (MTS), particularly under challenging non-stationary and stochastic regimes. HTV-Trans integrates a hierarchical probabilistic generative module (HTPGM) with a transformer-based encoder, enabling explicit modeling of multi-scale non-stationarity and stochastic characteristics. This approach recovers intrinsic non-stationary information into temporal dependencies and has demonstrated superior forecasting performance across multiple standard time series datasets (Wang et al., 2024).

1. Architectural Overview

HTV-Trans encompasses two tightly coupled components:

Hierarchical Time-series Probabilistic Generative Module (HTPGM): A multi-layer latent-variable architecture akin to a hierarchical variational autoencoder (VAE), designed to capture both stochasticity and non-stationary structure across multiple temporal scales.
Transformer Encoder with MLP Forecaster: Processes the stationarized input sequence, with self-attention representations augmented by a scalar-weighted interpolation of the latent summaries from HTPGM. This fusion injects non-stationary distributional cues into the deterministic transformer pathway.

For a given input series $X_n \in \mathbb{R}^{T\times V}$ , the processing sequence is as follows:

Series Stationarization: Apply sliding-window normalization to obtain $X_n'$ .
Transformer Encoding: Feed $X_n'$ to the transformer, with each attention input merged with an upsampled sum of latent states from HTPGM (modulated by $\alpha$ ).
Forecasting Head: Deterministic latent states $h_{1:T,n}$ , generated by the transformer, are used by a simple MLP for prediction and act as part of the generative process.
Hierarchical Inference and Generation: In parallel, an inference network over $X_n'$ produces variational posteriors $q(z^i_{t,n}|x)$ for each latent tier; generation samples top-down through the hierarchy of priors, conditioned on both higher-level latents and $h_{1:t-1,n}$ , to reconstruct the original series.

2. Generative and Inference Framework

2.1 Hierarchical Generative Model

The generative pathway maintains $L$ layers of latent variables $\{z^i_t\}_{i=1}^L$ , each potentially operating at its own temporal resolution $T_i$ :

Top Layer Prior: $p(z^L_t) = \mathcal{N}(0, I)$ (unconditional Gaussian).
Conditional Priors: For $i = L-1, ..., 1$ :

$p(z^i_t | z^{i+1}_t, h_{t-1}) = \mathcal{N}(\mu^i(z^{i+1}_t, h_{t-1}), \operatorname{diag}(\sigma^i(z^{i+1}_t, h_{t-1})))$

Observation Model: The lowest layer $z^1_t$ decodes to the original data:

$p(x_t | z^1_t) = \mathcal{N}(f_{\mathrm{dec}}(z^1_t), \tau^2 I)$

where $f_{\mathrm{dec}}$ is a small neural decoder; $\tau^2$ is a fixed variance.

Forecasting Term: An auxiliary prediction distribution $p(x_T | h_{1:T-1})$ is added to strengthen long-term predictions.

The full joint is:

$p(x_{1:T}, z^{1:L}_{1:T}) = \prod_{t=1}^T p(x_t | z^1_t) \left[\prod_{i=1}^{L-1} p(z^i_t | z^{i+1}_t, h_{t-1})\right] p(z^L_t) \; p(x_T | h_{1:T-1})^\gamma$

where $\gamma > 0$ trades off direct reconstruction and forecasting fidelity.

2.2 Variational Inference

The recognition (inference) model uses factorized Gaussians for each latent variable, conditioned only on the normalized windows:

$q(z^{1:L}_{1:T} | \hat x_{1:T}) = \prod_{t=1}^T \prod_{i=1}^L \mathcal{N}(z^i_t ; \mu^i_q(\hat x_t), \operatorname{diag}(\sigma^i_q(\hat x_t)))$

Latents are sampled via the standard reparameterization trick.

2.3 Evidence Lower Bound (ELBO)

The objective maximizes a multi-term ELBO:

$\mathcal{L} = \sum_{t=1}^T \mathbb{E}_{q(z^1_t)}[\log p(x_t|z^1_t)] + \gamma\,\mathbb{E}_{q(z^1_{1:T})}[\log p(x_T | h_{1:T-1})] - \sum_{t=1}^T \sum_{i=1}^L \mathrm{KL}(q(z^i_t | \hat x_t) \| p(z^i_t | z^{i+1}_t,h_{t-1}))$

3. Mechanisms for Handling Non-Stationarity and Stochasticity

Series Stationarization: Each input window is normalized via sliding-window mean and standard deviation, as in series-stationarization or RevIN. This stabilizes scale but removes local non-stationary structure from the initial transformer input.
Hierarchical Latents at Multiple Scales: Latent variable hierarchy allows coarse-to-fine modeling: upper layers (with downsampled time resolution) capture slow, large-scale distributional drift; lower layers capture fast, local shifts. Top-down priors convey non-stationary trends to finer scales.
Stochastic Generation: All latent variables are stochastic, supporting noise and multi-modality in time series distributions and enhancing expressivity.
Transformer Fusion: Latents are upsampled to $T$ steps, summed, and interpolated to produce a summary matrix. This is added to the attention input, modulated by scalar $\alpha$ , thus restoring non-stationary information to the transformer's attention mechanism.

4. Transformer Attention Implementation

The transformer encoder operates on the stationarized, embedded inputs:

$\tilde X = \mathrm{Embedding}(\hat X) + \alpha\,\mathrm{sum}$

where $\mathrm{sum}$ is the interpolated, upsampled latent summary and $\alpha$ is learnable.

Self-attention is multi-headed:

$\mathrm{head}_i = \mathrm{softmax}\!\left(\frac{Q_i K_i^\top}{\sqrt{d_k}}\right)V_i$

with $Q_i, K_i, V_i$ = projections of $\tilde X$ . Outputs are concatenated and projected through $W_O$ .

The final encoder state $h_{1:T}$ is mapped to the prediction horizon $\hat{x}_{T+1:T+H}$ via an MLP. The design does not use any explicit de-stationary attention; instead, the stochastic latent summary fulfills this function.

5. Training and Inference Procedures

Objective: Jointly optimize the ELBO over all model parameters using Adam with a conservative learning rate (e.g., $10^{-4}$ ), batch size $\sim 32$ , and early stopping based on validation performance.
Regularization: Only the KL divergence terms in the ELBO are used. No additional weight decay or dropout is critical.
Normalization: Predictions are de-normalized at the output stage to restore the original scale.
Inference: At test time, each window is processed via the encoder, latents are sampled, passed through the transformer, and the forecast is de-normalized.

6. Empirical Evaluation and Results

HTV-Trans was rigorously evaluated on multiple benchmarks:

Dataset	Series Type	Prediction Horizons
ETTh1/ETTh2	electricity transformer temp	{96,192,336,720}
ETTm1/ETTm2	electricity transformer temp	{96,192,336,720}
Weather	meteorological (hourly)	{96,192,336,720}
ILI	weekly flu incidence	{24,36,48,60}
Exchange	daily currency rates	{96,192,336,720}

Metrics: MSE and MAE (lower is superior).
Baselines: Informer, Autoformer, Fedformer, Pyraformer, Crossformer, Non-stationary Transformer.

Performance: HTV-Trans achieved the lowest MSE/MAE in the majority of settings, especially improving long-horizon (720-step) forecasts by 10–20% over the best transformer baseline. Ablation studies confirm that the dynamic prior in HTPGM, the auxiliary reconstruction term ( $\gamma > 0$ ), and optimal balancing $\alpha$ are all critical. Using 3–5 hierarchical latent layers best balances expressivity and overfitting. Visualizations illustrate the model’s superior ability to capture spikes, trend shifts, and volatility.

7. Significance and Implications

Explicitly modeling both non-stationarity (through a hierarchical latent structure) and stochasticity (via variational sampling) confers advantages for MTS forecasting, especially for complex, long-range, or distribution-shifting sequences. HTV-Trans demonstrates that the inductive biases needed for robust multivariate time series forecasting extend beyond stationarization, requiring structured latent generative modeling integrated with state-of-the-art sequence models. These results establish a new empirical and methodological baseline for probabilistic time series forecasting under non-stationary conditions (Wang et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

Considering Nonstationary within Multivariate Time Series with Variational Hierarchical Transformer for Forecasting (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hierarchical Time Series Variational Transformer (HTV-Trans).