HTV-Trans: Hierarchical Variational Transformer
- The paper introduces HTV-Trans, which integrates hierarchical latent variable modeling with transformer encoding to enhance forecasting in non-stationary and stochastic multivariate time series.
- It employs a multi-scale variational framework that captures both long-term trends and local fluctuations, leading to up to 20% lower forecasting error on standard benchmarks.
- The architecture fuses series stationarization with latent interpolation, enabling robust recovery of non-stationary information and improved performance in long-horizon predictions.
The Hierarchical Time Series Variational Transformer (HTV-Trans) is a variational generative dynamic model specifically designed to address forecasting in multivariate time series (MTS), particularly under challenging non-stationary and stochastic regimes. HTV-Trans integrates a hierarchical probabilistic generative module (HTPGM) with a transformer-based encoder, enabling explicit modeling of multi-scale non-stationarity and stochastic characteristics. This approach recovers intrinsic non-stationary information into temporal dependencies and has demonstrated superior forecasting performance across multiple standard time series datasets (Wang et al., 2024).
1. Architectural Overview
HTV-Trans encompasses two tightly coupled components:
- Hierarchical Time-series Probabilistic Generative Module (HTPGM): A multi-layer latent-variable architecture akin to a hierarchical variational autoencoder (VAE), designed to capture both stochasticity and non-stationary structure across multiple temporal scales.
- Transformer Encoder with MLP Forecaster: Processes the stationarized input sequence, with self-attention representations augmented by a scalar-weighted interpolation of the latent summaries from HTPGM. This fusion injects non-stationary distributional cues into the deterministic transformer pathway.
For a given input series , the processing sequence is as follows:
- Series Stationarization: Apply sliding-window normalization to obtain .
- Transformer Encoding: Feed to the transformer, with each attention input merged with an upsampled sum of latent states from HTPGM (modulated by ).
- Forecasting Head: Deterministic latent states , generated by the transformer, are used by a simple MLP for prediction and act as part of the generative process.
- Hierarchical Inference and Generation: In parallel, an inference network over produces variational posteriors for each latent tier; generation samples top-down through the hierarchy of priors, conditioned on both higher-level latents and , to reconstruct the original series.
2. Generative and Inference Framework
2.1 Hierarchical Generative Model
The generative pathway maintains layers of latent variables , each potentially operating at its own temporal resolution :
- Top Layer Prior: (unconditional Gaussian).
- Conditional Priors: For :
- Observation Model: The lowest layer decodes to the original data:
where is a small neural decoder; is a fixed variance.
- Forecasting Term: An auxiliary prediction distribution is added to strengthen long-term predictions.
The full joint is:
where trades off direct reconstruction and forecasting fidelity.
2.2 Variational Inference
The recognition (inference) model uses factorized Gaussians for each latent variable, conditioned only on the normalized windows:
Latents are sampled via the standard reparameterization trick.
2.3 Evidence Lower Bound (ELBO)
The objective maximizes a multi-term ELBO:
3. Mechanisms for Handling Non-Stationarity and Stochasticity
- Series Stationarization: Each input window is normalized via sliding-window mean and standard deviation, as in series-stationarization or RevIN. This stabilizes scale but removes local non-stationary structure from the initial transformer input.
- Hierarchical Latents at Multiple Scales: Latent variable hierarchy allows coarse-to-fine modeling: upper layers (with downsampled time resolution) capture slow, large-scale distributional drift; lower layers capture fast, local shifts. Top-down priors convey non-stationary trends to finer scales.
- Stochastic Generation: All latent variables are stochastic, supporting noise and multi-modality in time series distributions and enhancing expressivity.
- Transformer Fusion: Latents are upsampled to steps, summed, and interpolated to produce a summary matrix. This is added to the attention input, modulated by scalar , thus restoring non-stationary information to the transformer's attention mechanism.
4. Transformer Attention Implementation
The transformer encoder operates on the stationarized, embedded inputs:
where is the interpolated, upsampled latent summary and is learnable.
Self-attention is multi-headed:
with = projections of . Outputs are concatenated and projected through .
The final encoder state is mapped to the prediction horizon via an MLP. The design does not use any explicit de-stationary attention; instead, the stochastic latent summary fulfills this function.
5. Training and Inference Procedures
- Objective: Jointly optimize the ELBO over all model parameters using Adam with a conservative learning rate (e.g., ), batch size , and early stopping based on validation performance.
- Regularization: Only the KL divergence terms in the ELBO are used. No additional weight decay or dropout is critical.
- Normalization: Predictions are de-normalized at the output stage to restore the original scale.
- Inference: At test time, each window is processed via the encoder, latents are sampled, passed through the transformer, and the forecast is de-normalized.
6. Empirical Evaluation and Results
HTV-Trans was rigorously evaluated on multiple benchmarks:
| Dataset | Series Type | Prediction Horizons |
|---|---|---|
| ETTh1/ETTh2 | electricity transformer temp | {96,192,336,720} |
| ETTm1/ETTm2 | electricity transformer temp | {96,192,336,720} |
| Weather | meteorological (hourly) | {96,192,336,720} |
| ILI | weekly flu incidence | {24,36,48,60} |
| Exchange | daily currency rates | {96,192,336,720} |
- Metrics: MSE and MAE (lower is superior).
- Baselines: Informer, Autoformer, Fedformer, Pyraformer, Crossformer, Non-stationary Transformer.
Performance: HTV-Trans achieved the lowest MSE/MAE in the majority of settings, especially improving long-horizon (720-step) forecasts by 10–20% over the best transformer baseline. Ablation studies confirm that the dynamic prior in HTPGM, the auxiliary reconstruction term (), and optimal balancing are all critical. Using 3–5 hierarchical latent layers best balances expressivity and overfitting. Visualizations illustrate the model’s superior ability to capture spikes, trend shifts, and volatility.
7. Significance and Implications
Explicitly modeling both non-stationarity (through a hierarchical latent structure) and stochasticity (via variational sampling) confers advantages for MTS forecasting, especially for complex, long-range, or distribution-shifting sequences. HTV-Trans demonstrates that the inductive biases needed for robust multivariate time series forecasting extend beyond stationarization, requiring structured latent generative modeling integrated with state-of-the-art sequence models. These results establish a new empirical and methodological baseline for probabilistic time series forecasting under non-stationary conditions (Wang et al., 2024).