Papers
Topics
Authors
Recent
Search
2000 character limit reached

Decoder-Only Temporal Fusion Transformer

Updated 26 February 2026
  • Decoder-Only TFT is an interpretable, attention-based architecture that leverages autoregressive causal masked self-attention and static enrichment for multi-horizon forecasting.
  • It streamlines traditional sequence models by removing LSTM encoder components, while retaining feature selection, gating, and diagnostic interpretability.
  • The model integrates GRNs, GLUs, and a single interpretable multi-head self-attention layer to produce accurate multi-quantile predictions for real-world time-series data.

The Decoder-Only Temporal Fusion Transformer (TFT) is an interpretable, attention-based architecture for multi-horizon time-series forecasting, capable of handling mixtures of static covariates, known future, and observed past exogenous variables. The decoder-only configuration eliminates sequence-to-sequence or LSTM encoder components, resulting in a single-stack Transformer-style model that preserves TFT’s feature selection, gating, and interpretability mechanisms while relying exclusively on autoregressive, causal masked self-attention and position-based processing (Lim et al., 2019).

1. Architectural Specification

The decoder-only TFT architecture processes inputs at each time step tt and relative position n{k,...,τmax}n \in \{-k, ..., \tau_{max}\} by aggregating information from temporal embeddings and static context vectors. Its primary sub-layers, executed sequentially for each position and then aggregated, are:

  1. Gated Skip over Local Processing:

ϕ(t,n)LN[ϕinput(t,n)+GLUϕ~(ϕLSTM(t,n))]\phi(t,n) \leftarrow \mathrm{LN}[\phi_{input}(t,n) + \mathrm{GLU}_{\tilde{\phi}}(\phi_{LSTM}(t,n))]

In the absence of an encoder, ϕLSTM(t,n)\phi_{LSTM}(t,n) is replaced by a learned or sinusoidal positional encoding or a lightweight 1×11 \times 1 convolution.

  1. Static Enrichment:

θ(t,n)=GRNθ(ϕ(t,n),ce)\theta(t,n) = \mathrm{GRN}_\theta(\phi(t,n), c_e)

  1. Interpretable Multi-Head Self-Attention (causal masked): Computed by averaging attention matrices over heads and using a shared value projection:

B(t)=InterpretableMultiHead(Θ(t),Θ(t),Θ(t))B(t) = \mathrm{InterpretableMultiHead}(\Theta(t), \Theta(t), \Theta(t))

where Θ(t)=[θ(t,k);;θ(t,τmax)]\Theta(t) = [\theta(t,-k); \ldots; \theta(t,\tau_{max})].

  1. Attention-Gate Skip:

δ(t,n)LN[θ(t,n)+GLUδ(β(t,n))]\delta(t,n) \leftarrow \mathrm{LN}[\theta(t,n) + \mathrm{GLU}_\delta(\beta(t,n))]

  1. Position-wise Feed-Forward + Residual:

ψ~(t,n)=GRNψ(δ(t,n))\tilde{\psi}(t,n) = \mathrm{GRN}_\psi(\delta(t,n))

ψ(t,n)LN[ϕ(t,n)+GLUψ~(ψ~(t,n))]\psi(t,n) \leftarrow \mathrm{LN}[\phi(t,n) + \mathrm{GLU}_{\tilde{\psi}}(\tilde{\psi}(t,n))]

  1. Quantile Output: For each quantile qQq \in Q,

y^(q,t,τ)=Wqψ(t,τ)+bq\hat{y}(q, t, \tau) = W_q \psi(t, \tau) + b_q

All GLU and GRN modules include dropout prior to gating and a final layer normalization (Lim et al., 2019).

2. Mathematical Building Blocks and Formulas

The feature selection, gating, and context integration in decoder-only TFT are accomplished via the following modules:

Component Formula Notes
Gated Linear Unit (GLU) g(γ)=σ(W4γ+b4),    u(γ)=W5γ+b5,    GLU(γ)=g(γ)u(γ)g(\gamma) = \sigma(W_4\gamma + b_4),\;\;u(\gamma) = W_5\gamma + b_5,\;\;\mathrm{GLU}(\gamma) = g(\gamma) \odot u(\gamma) Gate and transform paths
Gated Residual Network (GRN) η2=ELU(W2a+W3c+b2),  η1=W1η2+b1,  x=Drop(η1),  y=GLU(x),  GRN(a,c)=LN(a+y)\eta_2 = \mathrm{ELU}(W_2 a + W_3 c + b_2), \;\eta_1 = W_1\eta_2 + b_1,\; x = \mathrm{Drop}(\eta_1),\; y = \mathrm{GLU}(x),\; \mathrm{GRN}(a,c) = \mathrm{LN}(a + y) Context input optional
Variable Selection (per timestep) vt=Softmax(GRNvs(Ξt,cs)),  ξ~t(j)=GRNξ(j)(ξt(j)),  ϕinput(t)=j=1mχvt[j]ξ~t(j)v_t = \mathrm{Softmax}(\mathrm{GRN}_{vs}(\Xi_t, c_s)),\; \tilde{\xi}_t^{(j)} = \mathrm{GRN}_{\xi(j)}(\xi_t^{(j)}),\; \phi_{input}(t) = \sum_{j=1}^{m_\chi} v_t[j] \cdot \tilde{\xi}_t^{(j)} Learned selection weights
Interpretable Multi-Head Attention Ah=Softmax((QhKhT)/dattn+Mask),  Aˉ=(1/H)h=1HAh,  Bout=Aˉ(ΘWV),  MultiHead(Θ)=BoutWHA_h=\mathrm{Softmax}((Q_hK_h^{\mathrm{T}})/\sqrt{d_{attn}}+ \mathrm{Mask}),\; \bar{A} = (1/H)\sum_{h=1}^H A_h,\; B_{out} = \bar{A}(\Theta W_V),\; \mathrm{MultiHead}(\Theta) = B_{out}W_H Heads averaged, V shared
Quantile Output y^(q,t,τ)=Wqψ(t,τ)+bq\hat{y}(q, t, \tau) = W_q \psi(t,\tau) + b_q Predicts all required quantiles

Editor’s term: The above table summarizes the core component equations directly from the original data except for labeling.

3. Input Representation and Data Flow

Decoder-only TFT supports three major input classes:

  • Static Covariates (sis_i): Each static feature (categorical via embeddings; continuous via linear projection) is mapped to Rdmodel\mathbb{R}^{d_{model}}, forwarded through a static variable-selection GRN to yield ζ\zeta, and passed through four GRNs to form context vectors {cs,ch,cc,ce}\{c_s, c_h, c_c, c_e\}.
  • Past-Observed Features (ztk:tz_{t-k:t}), Known-Future Inputs (xtk:t+τmaxx_{t-k:t+\tau_{max}}): For each time position nn, all features jj are embedded, variable selection is conducted by a GRN with context csc_s to produce selection weights vnv_n, leading to ϕinput(t,n)\phi_{input}(t,n) via a weighted sum.
  • Temporal Embeddings: In pure decoder settings, these may derive from learned or sinusoidal positional encoding or from a 1×1 convolution (in place of the LSTM).

The processing flow follows the route: Covariate encoders and variable selection → local processing (positional) → static enrichment → interpretable self-attention → attention gating → position-wise GRN and residual → linear layer for multi-quantile forecasts (Lim et al., 2019).

4. Pure Decoder Instantiation: Modifications and Interpretability

To realize the decoder-only TFT, all LSTM encoder and decoder structures are omitted. Local processing is replaced with either Transformer-style positional encodings or lightweight convolutions. Interpretability machinery—variable selection weights (for time-varying and static features), attention distributions from the multi-head block, and static context vectors—are fully retained. The resulting architecture is interpretable at both feature and temporal levels, while matching the core operational logic of the original TFT (Lim et al., 2019).

A single interpretable multi-head self-attention layer is retained in the decoder. This configuration enables direct interpretability hooks, such as:

  • Variable selection weights per timestep and feature
  • Attention patterns over the prediction horizon
  • Gating diagnostics via GRNs and GLUs

Lim et al. provide guidance for effective hyperparameter choices documented in their empirical studies:

Hyperparameter Typical Value(s)/Range
State size dmodeld_{model} 160,  240,  320160,\;240,\;320
#Attention heads mHm_H $4$ (or $1$ or $4$ suggested)
Dropout rate $0.1$–$0.3$
Minibatch size $64$–$128$
Learning rate 10310^{-3} (10210^{-2} for small/noisy data)
Max gradient norm $0.01$–$100$

These values are derived from experiments, and only a single interpretable multi-head attention layer is typically used in the decoder. For decoder-only settings, the depth of GRNs and skip connections may be tuned via dropout (Lim et al., 2019).

6. Significance and Use Cases

Decoder-only TFT preserves the interpretability and performance characteristics of the original architecture for multi-horizon forecasting applications across diverse real-world datasets. With its modular context/selection attentional design, the model is capable of yielding both high-accuracy quantile forecasts and explanatory attributions without recourse to black-boxing of temporal or feature dependencies. The removal of any encoder mechanism streamlines deployment, retaining all interpretable diagnostics for both model auditing and operational insight (Lim et al., 2019). A plausible implication is enhanced applicability in settings where only autoregressive or forward-facing architectures are feasible.

While the original TFT employed LSTM layers to capture local and temporal dependencies, the decoder-only variant transitions to pure attention and position-based mechanisms. This configuration aligns the architecture structurally closer to Transformer-style decoders but distinguishes itself via specialized variable selection, static enrichment, and an explicit focus on multi-quantile regression output. Such hybridization supports the full gamut of TFT interpretability and dynamic feature selection, differing from standard sequence decoders that lack these gates and selection layers (Lim et al., 2019). This suggests that decoder-only TFT can be viewed as a bridge between interpretably gated attention-based models and streamlined Transformer decoders.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Decoder-Only Temporal Fusion Transformer (TFT).