Decoder-Only Temporal Fusion Transformer

Updated 26 February 2026

Decoder-Only TFT is an interpretable, attention-based architecture that leverages autoregressive causal masked self-attention and static enrichment for multi-horizon forecasting.
It streamlines traditional sequence models by removing LSTM encoder components, while retaining feature selection, gating, and diagnostic interpretability.
The model integrates GRNs, GLUs, and a single interpretable multi-head self-attention layer to produce accurate multi-quantile predictions for real-world time-series data.

The Decoder-Only Temporal Fusion Transformer (TFT) is an interpretable, attention-based architecture for multi-horizon time-series forecasting, capable of handling mixtures of static covariates, known future, and observed past exogenous variables. The decoder-only configuration eliminates sequence-to-sequence or LSTM encoder components, resulting in a single-stack Transformer-style model that preserves TFT’s feature selection, gating, and interpretability mechanisms while relying exclusively on autoregressive, causal masked self-attention and position-based processing (Lim et al., 2019).

1. Architectural Specification

The decoder-only TFT architecture processes inputs at each time step $t$ and relative position $n \in \{-k, ..., \tau_{max}\}$ by aggregating information from temporal embeddings and static context vectors. Its primary sub-layers, executed sequentially for each position and then aggregated, are:

Gated Skip over Local Processing:

$\phi(t,n) \leftarrow \mathrm{LN}[\phi_{input}(t,n) + \mathrm{GLU}_{\tilde{\phi}}(\phi_{LSTM}(t,n))]$

In the absence of an encoder, $\phi_{LSTM}(t,n)$ is replaced by a learned or sinusoidal positional encoding or a lightweight $1 \times 1$ convolution.

Static Enrichment:

$\theta(t,n) = \mathrm{GRN}_\theta(\phi(t,n), c_e)$

Interpretable Multi-Head Self-Attention (causal masked): Computed by averaging attention matrices over heads and using a shared value projection:

$B(t) = \mathrm{InterpretableMultiHead}(\Theta(t), \Theta(t), \Theta(t))$

where $\Theta(t) = [\theta(t,-k); \ldots; \theta(t,\tau_{max})]$ .

Attention-Gate Skip:

$\delta(t,n) \leftarrow \mathrm{LN}[\theta(t,n) + \mathrm{GLU}_\delta(\beta(t,n))]$

Position-wise Feed-Forward + Residual:

$\tilde{\psi}(t,n) = \mathrm{GRN}_\psi(\delta(t,n))$

$\psi(t,n) \leftarrow \mathrm{LN}[\phi(t,n) + \mathrm{GLU}_{\tilde{\psi}}(\tilde{\psi}(t,n))]$

Quantile Output: For each quantile $q \in Q$ ,

$\hat{y}(q, t, \tau) = W_q \psi(t, \tau) + b_q$

All GLU and GRN modules include dropout prior to gating and a final layer normalization (Lim et al., 2019).

2. Mathematical Building Blocks and Formulas

The feature selection, gating, and context integration in decoder-only TFT are accomplished via the following modules:

Component	Formula	Notes
Gated Linear Unit (GLU)	$g(\gamma) = \sigma(W_4\gamma + b_4),\;\;u(\gamma) = W_5\gamma + b_5,\;\;\mathrm{GLU}(\gamma) = g(\gamma) \odot u(\gamma)$	Gate and transform paths
Gated Residual Network (GRN)	$\eta_2 = \mathrm{ELU}(W_2 a + W_3 c + b_2), \;\eta_1 = W_1\eta_2 + b_1,\; x = \mathrm{Drop}(\eta_1),\; y = \mathrm{GLU}(x),\; \mathrm{GRN}(a,c) = \mathrm{LN}(a + y)$	Context input optional
Variable Selection (per timestep)	$v_t = \mathrm{Softmax}(\mathrm{GRN}_{vs}(\Xi_t, c_s)),\; \tilde{\xi}_t^{(j)} = \mathrm{GRN}_{\xi(j)}(\xi_t^{(j)}),\; \phi_{input}(t) = \sum_{j=1}^{m_\chi} v_t[j] \cdot \tilde{\xi}_t^{(j)}$	Learned selection weights
Interpretable Multi-Head Attention	$A_h=\mathrm{Softmax}((Q_hK_h^{\mathrm{T}})/\sqrt{d_{attn}}+ \mathrm{Mask}),\; \bar{A} = (1/H)\sum_{h=1}^H A_h,\; B_{out} = \bar{A}(\Theta W_V),\; \mathrm{MultiHead}(\Theta) = B_{out}W_H$	Heads averaged, V shared
Quantile Output	$\hat{y}(q, t, \tau) = W_q \psi(t,\tau) + b_q$	Predicts all required quantiles

Editor’s term: The above table summarizes the core component equations directly from the original data except for labeling.

3. Input Representation and Data Flow

Decoder-only TFT supports three major input classes:

Static Covariates ( $s_i$ ): Each static feature (categorical via embeddings; continuous via linear projection) is mapped to $\mathbb{R}^{d_{model}}$ , forwarded through a static variable-selection GRN to yield $\zeta$ , and passed through four GRNs to form context vectors $\{c_s, c_h, c_c, c_e\}$ .
Past-Observed Features ( $z_{t-k:t}$ ), Known-Future Inputs ( $x_{t-k:t+\tau_{max}}$ ): For each time position $n$ , all features $j$ are embedded, variable selection is conducted by a GRN with context $c_s$ to produce selection weights $v_n$ , leading to $\phi_{input}(t,n)$ via a weighted sum.
Temporal Embeddings: In pure decoder settings, these may derive from learned or sinusoidal positional encoding or from a 1×1 convolution (in place of the LSTM).

The processing flow follows the route: Covariate encoders and variable selection → local processing (positional) → static enrichment → interpretable self-attention → attention gating → position-wise GRN and residual → linear layer for multi-quantile forecasts (Lim et al., 2019).

4. Pure Decoder Instantiation: Modifications and Interpretability

To realize the decoder-only TFT, all LSTM encoder and decoder structures are omitted. Local processing is replaced with either Transformer-style positional encodings or lightweight convolutions. Interpretability machinery—variable selection weights (for time-varying and static features), attention distributions from the multi-head block, and static context vectors—are fully retained. The resulting architecture is interpretable at both feature and temporal levels, while matching the core operational logic of the original TFT (Lim et al., 2019).

A single interpretable multi-head self-attention layer is retained in the decoder. This configuration enables direct interpretability hooks, such as:

Variable selection weights per timestep and feature
Attention patterns over the prediction horizon
Gating diagnostics via GRNs and GLUs

5. Recommended Hyperparameters

Lim et al. provide guidance for effective hyperparameter choices documented in their empirical studies:

Hyperparameter	Typical Value(s)/Range
State size $d_{model}$	$160,\;240,\;320$
#Attention heads $m_H$	$4$ (or $1$ or $4$ suggested)
Dropout rate	$0.1$–$0.3$
Minibatch size	$64$–$128$
Learning rate	$10^{-3}$ ( $10^{-2}$ for small/noisy data)
Max gradient norm	$0.01$–$100$

These values are derived from experiments, and only a single interpretable multi-head attention layer is typically used in the decoder. For decoder-only settings, the depth of GRNs and skip connections may be tuned via dropout (Lim et al., 2019).

6. Significance and Use Cases

Decoder-only TFT preserves the interpretability and performance characteristics of the original architecture for multi-horizon forecasting applications across diverse real-world datasets. With its modular context/selection attentional design, the model is capable of yielding both high-accuracy quantile forecasts and explanatory attributions without recourse to black-boxing of temporal or feature dependencies. The removal of any encoder mechanism streamlines deployment, retaining all interpretable diagnostics for both model auditing and operational insight (Lim et al., 2019). A plausible implication is enhanced applicability in settings where only autoregressive or forward-facing architectures are feasible.

While the original TFT employed LSTM layers to capture local and temporal dependencies, the decoder-only variant transitions to pure attention and position-based mechanisms. This configuration aligns the architecture structurally closer to Transformer-style decoders but distinguishes itself via specialized variable selection, static enrichment, and an explicit focus on multi-quantile regression output. Such hybridization supports the full gamut of TFT interpretability and dynamic feature selection, differing from standard sequence decoders that lack these gates and selection layers (Lim et al., 2019). This suggests that decoder-only TFT can be viewed as a bridge between interpretably gated attention-based models and streamlined Transformer decoders.

Markdown Report Issue Upgrade to Chat

References (1)

Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Decoder-Only Temporal Fusion Transformer (TFT).

Decoder-Only Temporal Fusion Transformer

1. Architectural Specification

2. Mathematical Building Blocks and Formulas

3. Input Representation and Data Flow

4. Pure Decoder Instantiation: Modifications and Interpretability

5. Recommended Hyperparameters

6. Significance and Use Cases

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Decoder-Only Temporal Fusion Transformer

1. Architectural Specification

2. Mathematical Building Blocks and Formulas

3. Input Representation and Data Flow

4. Pure Decoder Instantiation: Modifications and Interpretability

5. Recommended Hyperparameters

6. Significance and Use Cases

7. Comparison with Original TFT and Related Architectures

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research