Decoder-Only Temporal Fusion Transformer
- Decoder-Only TFT is an interpretable, attention-based architecture that leverages autoregressive causal masked self-attention and static enrichment for multi-horizon forecasting.
- It streamlines traditional sequence models by removing LSTM encoder components, while retaining feature selection, gating, and diagnostic interpretability.
- The model integrates GRNs, GLUs, and a single interpretable multi-head self-attention layer to produce accurate multi-quantile predictions for real-world time-series data.
The Decoder-Only Temporal Fusion Transformer (TFT) is an interpretable, attention-based architecture for multi-horizon time-series forecasting, capable of handling mixtures of static covariates, known future, and observed past exogenous variables. The decoder-only configuration eliminates sequence-to-sequence or LSTM encoder components, resulting in a single-stack Transformer-style model that preserves TFT’s feature selection, gating, and interpretability mechanisms while relying exclusively on autoregressive, causal masked self-attention and position-based processing (Lim et al., 2019).
1. Architectural Specification
The decoder-only TFT architecture processes inputs at each time step and relative position by aggregating information from temporal embeddings and static context vectors. Its primary sub-layers, executed sequentially for each position and then aggregated, are:
- Gated Skip over Local Processing:
In the absence of an encoder, is replaced by a learned or sinusoidal positional encoding or a lightweight convolution.
- Static Enrichment:
- Interpretable Multi-Head Self-Attention (causal masked): Computed by averaging attention matrices over heads and using a shared value projection:
where .
- Attention-Gate Skip:
- Position-wise Feed-Forward + Residual:
- Quantile Output: For each quantile ,
All GLU and GRN modules include dropout prior to gating and a final layer normalization (Lim et al., 2019).
2. Mathematical Building Blocks and Formulas
The feature selection, gating, and context integration in decoder-only TFT are accomplished via the following modules:
| Component | Formula | Notes |
|---|---|---|
| Gated Linear Unit (GLU) | Gate and transform paths | |
| Gated Residual Network (GRN) | Context input optional | |
| Variable Selection (per timestep) | Learned selection weights | |
| Interpretable Multi-Head Attention | Heads averaged, V shared | |
| Quantile Output | Predicts all required quantiles |
Editor’s term: The above table summarizes the core component equations directly from the original data except for labeling.
3. Input Representation and Data Flow
Decoder-only TFT supports three major input classes:
- Static Covariates (): Each static feature (categorical via embeddings; continuous via linear projection) is mapped to , forwarded through a static variable-selection GRN to yield , and passed through four GRNs to form context vectors .
- Past-Observed Features (), Known-Future Inputs (): For each time position , all features are embedded, variable selection is conducted by a GRN with context to produce selection weights , leading to via a weighted sum.
- Temporal Embeddings: In pure decoder settings, these may derive from learned or sinusoidal positional encoding or from a 1×1 convolution (in place of the LSTM).
The processing flow follows the route: Covariate encoders and variable selection → local processing (positional) → static enrichment → interpretable self-attention → attention gating → position-wise GRN and residual → linear layer for multi-quantile forecasts (Lim et al., 2019).
4. Pure Decoder Instantiation: Modifications and Interpretability
To realize the decoder-only TFT, all LSTM encoder and decoder structures are omitted. Local processing is replaced with either Transformer-style positional encodings or lightweight convolutions. Interpretability machinery—variable selection weights (for time-varying and static features), attention distributions from the multi-head block, and static context vectors—are fully retained. The resulting architecture is interpretable at both feature and temporal levels, while matching the core operational logic of the original TFT (Lim et al., 2019).
A single interpretable multi-head self-attention layer is retained in the decoder. This configuration enables direct interpretability hooks, such as:
- Variable selection weights per timestep and feature
- Attention patterns over the prediction horizon
- Gating diagnostics via GRNs and GLUs
5. Recommended Hyperparameters
Lim et al. provide guidance for effective hyperparameter choices documented in their empirical studies:
| Hyperparameter | Typical Value(s)/Range |
|---|---|
| State size | |
| #Attention heads | $4$ (or $1$ or $4$ suggested) |
| Dropout rate | $0.1$–$0.3$ |
| Minibatch size | $64$–$128$ |
| Learning rate | ( for small/noisy data) |
| Max gradient norm | $0.01$–$100$ |
These values are derived from experiments, and only a single interpretable multi-head attention layer is typically used in the decoder. For decoder-only settings, the depth of GRNs and skip connections may be tuned via dropout (Lim et al., 2019).
6. Significance and Use Cases
Decoder-only TFT preserves the interpretability and performance characteristics of the original architecture for multi-horizon forecasting applications across diverse real-world datasets. With its modular context/selection attentional design, the model is capable of yielding both high-accuracy quantile forecasts and explanatory attributions without recourse to black-boxing of temporal or feature dependencies. The removal of any encoder mechanism streamlines deployment, retaining all interpretable diagnostics for both model auditing and operational insight (Lim et al., 2019). A plausible implication is enhanced applicability in settings where only autoregressive or forward-facing architectures are feasible.
7. Comparison with Original TFT and Related Architectures
While the original TFT employed LSTM layers to capture local and temporal dependencies, the decoder-only variant transitions to pure attention and position-based mechanisms. This configuration aligns the architecture structurally closer to Transformer-style decoders but distinguishes itself via specialized variable selection, static enrichment, and an explicit focus on multi-quantile regression output. Such hybridization supports the full gamut of TFT interpretability and dynamic feature selection, differing from standard sequence decoders that lack these gates and selection layers (Lim et al., 2019). This suggests that decoder-only TFT can be viewed as a bridge between interpretably gated attention-based models and streamlined Transformer decoders.