Papers
Topics
Authors
Recent
Search
2000 character limit reached

Transformer Is Inherently a Causal Learner

Published 9 Jan 2026 in cs.LG and cs.AI | (2601.05647v1)

Abstract: We reveal that transformers trained in an autoregressive manner naturally encode time-delayed causal structures in their learned representations. When predicting future values in multivariate time series, the gradient sensitivities of transformer outputs with respect to past inputs directly recover the underlying causal graph, without any explicit causal objectives or structural constraints. We prove this connection theoretically under standard identifiability conditions and develop a practical extraction method using aggregated gradient attributions. On challenging cases such as nonlinear dynamics, long-term dependencies, and non-stationary systems, this approach greatly surpasses the performance of state-of-the-art discovery algorithms, especially as data heterogeneity increases, exhibiting scaling potential where causal accuracy improves with data volume and heterogeneity, a property traditional methods lack. This unifying view lays the groundwork for a future paradigm where causal discovery operates through the lens of foundation models, and foundation models gain interpretability and enhancement through the lens of causality.

Summary

  • The paper demonstrates that decoder-only transformers inherently learn lagged causal structures through gradient-based attribution during autoregressive prediction.
  • It leverages Layer-wise Relevance Propagation to extract robust population-level sensitivities, yielding scalable and high-fidelity causal graphs.
  • Empirical analyses show that transformers maintain precision across noise, non-stationarity, and nonlinear dynamics, outperforming classical Granger causality methods.

Transformer Is Inherently a Causal Learner

Theoretical Foundations

The paper establishes that decoder-only transformers, when trained autoregressively for next-step prediction in multivariate time series, naturally encode lagged causal structures in their learned representations. Crucially, under standard identifiability assumptions (conditional exogeneity, no instantaneous effects, lag-window coverage, faithfulness), the causal graph governing temporal dependencies can be directly recovered via the gradient sensitivities of the transformer's outputs with respect to past inputs. Formally, causal links jij \stackrel{\ell}{\to} i exist if the score gradient energy Hj,i=E[(xj,tlogp(Xi,tX<t))2]H_{j,i}^\ell = \mathbb{E}[ (\partial_{x_{j,t-\ell}} \log p(X_{i,t} | X_{<t}))^2 ] is positive.

Compared to classical Granger causality, this criterion operates on the full conditional distribution, not only the mean, providing higher fidelity for nonlinear and non-Gaussian dynamics. Notably, any model that fits the relevant conditional distributions is suitable for this approach; the paper advances decoder-only transformers as the architecture of choice due to scalability and alignment with contemporary deep learning practice. Figure 1

Figure 1: The autoregressive transformer connects lagged tokens to targets, modeling temporal generative processes and allowing for causal attribution via relevance scores.

Causal Structure Extraction via Gradient Attribution

Rather than relying on raw attention weights—which are confounded and non-faithful in deep transformer stacks due to token mixing—the procedure uses Layer-wise Relevance Propagation (LRP) to compute gradient-based attributions. For each output variable and lagged input, relevance scores are aggregated across the dataset to estimate population-level sensitivities, which are then binarized using top-k or uniform-threshold selection rules to yield the inferred causal graph.

This operationalizes the link between forecasting and structure discovery, turning the standard next-step prediction objective into a scalable estimator for temporal causal graphs within data-rich, nonlinear, and high-dimensional domains. The paper theoretically proves identifiability, derives explicit forms for various noise models (notably homoscedastic Gaussian), and connects practical LRP implementations to these theoretical results.

Empirical Analysis: Robustness and Scaling

Extensive simulation experiments demonstrate that transformer-based causal discovery is robust and exceptionally competitive across a spectrum of regimes:

  • High dimensionality: Transformers maintain precision and recall as the number of variables or maximum lag increases, outperforming constraint- and score-based baselines which degrade with dimensionality.
  • Nonlinear dynamics: Accuracy steadily improves with sample size. Transformers outperform traditional methods as functional complexity grows, especially for non-additive or complex nonlinear structures.
  • Non-stationarity: The approach scales favorably as data heterogeneity (number of domains) increases, in contrast to classical algorithms that struggle to separate shifting regimes.
  • Noise heterogeneity: Transformer performance is invariant to noise type and variance, while baselines show pronounced degradation under heteroscedasticity. Figure 2

    Figure 2: Mean F1 score analysis across experimental regimes, quantifying transformer robustness to high dimensionality and data complexity.

    Figure 3

    Figure 3: F1 scores under nonlinear dependency structures, illustrating superior scaling with sample size for transformers.

    Figure 4

    Figure 4: F1 scores in non-stationary environments, showing improved accuracy with increasing sample size and number of domains.

    Figure 5

    Figure 5: Robustness of transformer causal inference to latent variable prevalence and noise type.

Addressing Latent Confounding and Instantaneous Effects

The native architecture does not explicitly model latent confounders or contemporaneous relationships, leading to spurious edges under assumption violations. However, integrating domain knowledge, post-processing with latent-aware methods (e.g., L-PCMCI), or combining transformer-inferred lagged structures with traditional causal discovery dramatically improves robustness and accuracy. Figure 6

Figure 6: Postprocessing and domain indicator integration to optimize causal graph recovery under latent confounding and regime shift.

Transformer Architecture and Attribution Analysis

Experiments dissect the effects of architecture choices:

  • Model depth: Increasing layers enhances the model's ability to capture long-range and nonlinear dependencies, with marked improvements in causal recovery fidelity.
  • Attention interpretation: Raw attention scores are a poor proxy for causal influence in deep models, validated by the misalignment between attention-based and gradient-based attributions.
  • Binarization: Top-k selection stabilizes precision as lag window and variable count grow. Figure 7

    Figure 7: Transformer variant comparison; deep models with LRP attributions perform best on long-range, nonlinear, and non-stationary regimes.

Uncertainty Quantification and Model Selection

Aggregation of per-sample relevance scores and rank-based metrics enables uncertainty quantification, facilitating confidence estimates for inferred edges. This approach can prioritize edges for further analysis or application in domains where false positive rates must be tightly controlled. Figure 8

Figure 8

Figure 8

Figure 8: Uncertainty analysis via mean and variance of edge relevance rank, surfacing high-confidence causal edges.

Implications and Future Directions

This work posits a unifying paradigm for causality in sequence modeling: large foundation models are inherently causal learners when equipped with gradient-based attribution. This reframes causal discovery as a secondary effect of scalable representation learning, with implications for zero-shot adaptation, data efficiency, cross-modal transfer, and interpretability in foundation models. The underlying theory motivates future model developments—such as native integration of structural priors, latent variable handling, and environment-invariant mechanism modeling—potentially leading to universal causal structure learning across scientific and industrial domains.

Conclusion

The paper rigorously demonstrates that decoder-only transformers trained for autoregressive prediction can recover temporal causal graphs via gradient-based relevance attributions, under modest and standard identifiability assumptions. Empirical results highlight strong accuracy, robustness, and scalability that surpass classical causal discovery algorithms, particularly in data-rich, complex, and heterogeneous regimes. This elevates the transformer architecture as an interpretable and robust causal learner, suggesting that the next generation of foundation models may unify forecasting and structure discovery, achieving generalizable mechanism-level understanding and diagnostic capability for real-world systems (2601.05647).

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.