PDTrans: Probabilistic Decomposition Transformer

Updated 27 March 2026

PDTrans is a framework for interpretable, hierarchical, and robust probabilistic modeling in Transformer architectures, enabling clear trend/seasonal separation and mechanistic circuit discovery.
It uses output-space decomposition with a latent variable model to reduce cumulative forecasting errors and improve the extraction of interpretable sub-series.
Its parameter-space decomposition via SPD isolates rank-1 subcomponents, allowing for targeted intervention on model circuits and scalable, parallelizable inference.

The Probabilistic Decomposition Transformer (PDTrans) is a generic framework for achieving interpretable, hierarchical, and robust probabilistic modeling in Transformer architectures. PDTrans encompasses both architectural innovations for forecasting and systematic methods for parameter-space decomposition, with proven utility in time-series forecasting, mechanistic interpretability, and uncertainty quantification across tasks such as tabular prediction. The hallmark of PDTrans is its ability to perform explicit and inherently probabilistic decomposition—either in output space (e.g., trend/seasonal separation) or parameter space (e.g., sparse mechanistic circuits)—while controlling error propagation and enabling efficient, parallelizable inference.

1. Architectural Principles of PDTrans

Two canonical realizations of PDTrans dominate the literature: output-space decomposition (via latent variable modeling layered over a Transformer) and parameter-space decomposition (via Stochastic Parameter Decomposition, SPD).

In output-space decomposition, PDTrans concatenates a canonical encoder–decoder Transformer with a conditional generative latent variable model. The Transformer provides stepwise autoregressive forecasts by learning temporal dependencies, whereas the VAE-style latent module absorbs the preliminary forecasts and context into a lower-dimensional stochastic encoding. This latent encoding is then decoded in a non-autoregressive, sequence-level manner, enabling joint correction across the forecasting horizon and decomposing the output into trend and seasonal components. This two-stage hierarchical design reduces cumulative error (exposure bias) and extracts interpretable sub-series for downstream analysis (Tong et al., 2022).

In parameter-space decomposition, PDTrans leverages SPD to decompose each weight matrix in a Transformer layer into a sum of rank-1 subcomponents. Causal-importance networks modulate the activation of each subcomponent for an input, yielding a context-sensitive, probabilistic mixture over mechanistic subroutines. This enables pinpointing and intervention upon interpretable circuits responsible for specific behaviors, discoveries, or facts in the model (Christensen et al., 12 Nov 2025).

2. Methodological Components

2.1 Output-Space Decomposition

The Transformer module receives, at each timestep $t$ , a vector of past observations $Y_{t-1}$ , covariates $X_t$ , and positional/embedding information. Standard encoder–decoder stacks process sequence inputs via multi-head self-attention, cross-attention, and feed-forward networks. The output embedding $f_t \in \mathbb{R}^{d_{\rm model}}$ is projected to likelihood parameters (e.g., for Gaussian, $\mu_t$ and $\sigma_t$ ).

A conditional generative module then models the predictive parameters for the future horizon as random variables depending on a latent $z$ , with inference via a diagonal Gaussian encoder. Decoder output is split additively into trend and seasonal stochastic components,

$Y_t^{\rm trend} \sim \mathcal{N}(\mu_t^{\rm trend}, \sigma_t^2/2),\quad Y_t^{\rm seasonal} \sim \mathcal{N}(\mu_t^{\rm seasonal}, \sigma_t^2/2),\quad \hat\mu_t = \mu_t^{\rm trend} + \mu_t^{\rm seasonal}.$

This decomposition is enforced by designing $\mu_t^{\rm trend}$ as an average-pooled MLP projection of $z$ , isolating slow-varying patterns, and $Y_{t-1}$ 0 as an MLP, capturing rapid periodicity.

Training optimizes a weighted sum of the autoregressive negative log-likelihood, Kullback–Leibler penalty on latent mismatch, and reconstruction loss on the decomposed forecast, with explicit scaling $Y_{t-1}$ 1 (Tong et al., 2022).

2.2 Parameter-Space Decomposition

For each layer $Y_{t-1}$ 2 and component $Y_{t-1}$ 3, learn $Y_{t-1}$ 4 such that $Y_{t-1}$ 5. For each input $Y_{t-1}$ 6, a position- and token-sensitive causal-importance score $Y_{t-1}$ 7 is produced by minimal-attention–augmented MLPs. The effective weight matrix for an input is $Y_{t-1}$ 8 with $Y_{t-1}$ 9 sampled or set by $X_t$ 0.

SPD loss combines faithfulness (weight reconstruction), minimality (activations), and stochastic/deterministic reconstruction (output KL divergence), permitting efficient isolation of sparse, actionable parameter subspaces for mechanistic editing or scientific probing (Christensen et al., 12 Nov 2025).

3. Applications and Empirical Evaluation

Time Series Forecasting

PDTrans demonstrates state-of-the-art or superior performance on classical time-series datasets:

Electricity, Traffic, Solar, Exchange, M4-Hourly, measured via quantile losses $X_t$ 1 (median) and $X_t$ 2 (tail).
PDTrans achieves $X_t$ 3 on Electricity and $X_t$ 4 on Traffic (24h-ahead), outperforming contemporaneous baselines such as SSDNet, Informer, and N-Beats (Tong et al., 2022).
Ablations confirm that the probabilistic decomposition contributes improvements (up to 15% on Traffic) and robustness to hyperparameter changes.

Mechanistic Interpretability and Circuit Discovery

Application of SPD-based PDTrans to GPT-2-small identifies $X_t$ 5 of all parameter subcomponents as causally responsible for specific factual completions ("Kobe Bryant" $X_t$ 6 "basketball", "Tiger Woods" $X_t$ 7 "golf"). Ablation of these subcomponents dramatically reduces correct prediction probability for the associated fact, with minimal collateral impact, establishing direct mechanistic control (Christensen et al., 12 Nov 2025). Recovery of classic "induction head" circuits demonstrates SPD's efficacy in isolating interpretable algorithmic motifs in sequence models.

Uncertainty Decomposition in Tabular Prediction

The TabPFN instantiation of PDTrans amortizes Bayesian predictive inference into a transformer. A predictive central limit theorem (CLT) for the output sequence enables fast, frequentist-calibrated credible bands for epistemic uncertainty, based on volatility of predictive updates along the evidence context (Fortini et al., 4 Feb 2026). For classification, entropy-based decomposition cleanly separates aleatoric and epistemic entropy, with all relevant quantities derivable from model outputs via closed-form Beta approximations. Empirical evaluation demonstrates near-nominal coverage for credible intervals across multiple simulated data-generating processes and real-world datasets.

4. Interpretability, Complexity, and Limitations

Output-space PDTrans guarantees explicit trend/seasonal separation, aligning decomposed components with real-world pattern classes (e.g., business hours, daily cycles). Parameter-space SPD-based PDTrans yields precise, fine-grained circuit handles per-layer and position, suitable for targeted model editing or ablation.

PDTrans adds negligible computational overhead ( $X_t$ 8 per batch for latent encoding/decoding, with $X_t$ 9), and, in output-space formulations, supports parallel joint inference across the forecast horizon, avoiding error drift inherent to purely autoregressive models (Tong et al., 2022). In SPD-based approaches, current practice incurs overhead from per-subcomponent gating networks and requires careful hyperparameter tuning. The method's success in large-scale LLMs depends on advances in scalable, blockwise, or hierarchical SPD (Christensen et al., 12 Nov 2025).

5. Extensions and Theoretical Generalizations

Research has extended PDTrans in several directions:

Scalability: Hierarchical or blockwise SPD holds promise for scaling to large models (e.g., GPT-3).
Generalized Causal Importance: Augmenting $f_t \in \mathbb{R}^{d_{\rm model}}$ 0-networks with richer attention and feature interactions aims to capture complex, non-rank-1 phenomena.
Uncertainty Quantification: Predictive-CLT-based PDTrans variants (as in TabPFN) realize black-box, efficiently computable credible bands and entropy decompositions, attaining practical frequentist coverage in supervised settings (Fortini et al., 4 Feb 2026).
Transferability: Output decomposition modules can be attached to alternative architectures (e.g., "PD-DeepAR"), transferring the benefits of trend/seasonal separation.

Limitations remain: compute overhead in SPD, circuit extraction in high-capacity LLMs, requirement for hyperparameter tuning (e.g., number of components $f_t \in \mathbb{R}^{d_{\rm model}}$ 1, loss weights $f_t \in \mathbb{R}^{d_{\rm model}}$ 2), and incomplete ground-truth circuit recovery constrain full generality.

6. Significance and Broader Impact

PDTrans provides a principled approach to probabilistic modeling, interpretability, and uncertainty estimation in Transformers. Its design addresses key deficiencies of classic sequence models—exposure bias in autoregressive decoding, opacity of learned representations, and the lack of uncertainty-aware predictions. Its empirical superiority in time-series forecasting, mechanistic modeling, and tabular prediction underscores its generality and practical value (Tong et al., 2022, Christensen et al., 12 Nov 2025, Fortini et al., 4 Feb 2026). These advances position PDTrans as a foundational framework for interpretable, robust, and uncertainty-calibrated machine learning in sequential and structured data domains.

Markdown Report Issue Upgrade to Chat

References (3)

Probabilistic Decomposition Transformer for Time Series Forecasting (2022)

Decomposition of Small Transformer Models (2025)

A principled framework for uncertainty decomposition in TabPFN (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Probabilistic Decomposition Transformer (PDTrans).