Causal Attention for Sequential Recommendation

Updated 27 October 2025

The paper introduces a method that integrates causal graph discovery into attention mechanisms to filter non-causal dependencies.
It details the CausalRec architecture, including an embedding layer, a causal discovery block, and a CausalBooster that yields improvements up to 15% in key metrics.
The approach enhances interpretability and robustness by ensuring that recommendations rely on true cause–effect relationships, mitigating spurious correlations.

Causal attention for sequential recommendation is a class of techniques that augment standard attention mechanisms by explicitly modeling or inferring the causal influences underlying user interaction sequences. The goal is to separate genuine cause–effect relationships from mere statistical correlations, yielding more robust, explainable, and accurate recommendation systems. Methods in this domain typically involve learning or leveraging causal graphs, probabilistic latent spaces, or counterfactual structures and then using this causal information to refine the traditional sequence-to-sequence attention models.

1. Principles and Motivation

Traditional attention-based sequential recommendation models, such as Transformer architectures, excel at modeling dependencies by learning soft correlations between items through mechanisms like scaled dot-product attention. However, such models can easily capture spurious patterns emerging from item co-occurrences or dataset bias, leading to recommendations that are not reflective of the true causal dynamics within user behavior sequences. Causal attention methods aim to rectify these shortcomings by:

Discovering and encoding the underlying causal graph among items,
Filtering out non-causal (spurious) dependencies,
Prioritizing historical behaviors demonstrably causally relevant to the target prediction.

This paradigm is grounded in the principles of graphical causal modeling and structural causal models (SCMs), in which user behaviors are viewed as outcomes of directed acyclic relations among previous actions, exogenous factors, and system interventions.

2. Architecture and Components of CausalRec

CausalRec exemplifies the integration of causality into attention mechanisms for sequential recommendation (Hou et al., 24 Oct 2025). Its architecture comprises:

Embedding Layer: Encodes raw user–item sequences into dense vectors augmented with position embeddings.
Causal Discovery Block: Learns a data-driven, item-level causal graph from user sequences using SCM-inspired analysis. Specifically, it models each item vector $x_i$ as

$x_i = \sum_{j \in Pa(i)} \beta_{ij} x_j + \lambda_i u_i$

or in matrix form:

$X = B X + \Lambda U$

where $B$ is the learned adjacency matrix encoding directed causal relations, $\Lambda$ scales exogenous noise $U$ , and $Pa(i)$ is the set of parent items causing $i$ .

CausalBooster: Refines self-attention so that items causally linked to a prediction receive a multiplicative score boost. Formally, for attention layer $l$ ,

$\widetilde{A}^{l} = A^{l} \odot (\mathbf{1}_n \mathbf{1}_n^\top + \alpha R)$

where $A^{l}$ is the vanilla attention matrix, $R$ is the learned causal adjacency, $\alpha$ modulates boost strength, and $\odot$ is the Hadamard (element-wise) product.

Prediction Layer: Computes relevance between the causally refined user sequence representation and candidate items, typically using dot-product scoring followed by a softmax for ranking.

3. Causal Discovery: Theory, Constraints, and Identifiability

The causal discovery block distinguishes itself from classical correlation-based sequence modeling by learning a directed acyclic graph (DAG) that encodes the plausible generative process of user actions. Its methodology includes:

Covariance-based Causal Graph Recovery: Post layer normalization, the user sequence embeddings are used to estimate covariance,

$\operatorname{Cov}(X) = (I - B)^{-1} \Lambda \operatorname{Cov}(U)((I - B)^{-1} \Lambda)^\top$

The identifiability of $B$ is guaranteed (under equal exogenous noise variance assumptions) by known results for linear SEMs.

Acyclicity and Sparsity Regularization: To ensure the learned $B$ represents a DAG, a continuous acyclicity constraint is enforced:

$\operatorname{trace}(\exp(W \odot W)) = n$

along with $\ell_1$ sparsity to discourage overly dense graphs.

This design guarantees that, under the right conditions, the estimated causal graph uniquely reflects the true item-level causal structure in behavior sequences (Hou et al., 24 Oct 2025).

4. CausalBooster: Integrating Causality into Attention

The CausalBooster mechanism transforms the attention process by assigning greater weight to sequence positions with higher causal significance:

The conventional self-attention computes weights reflecting statistical dependencies between candidate and historical items.
CausalBooster modifies these scores using the discovered DAG: for items $i$ and $j$ ,

$\widetilde{A}_{ij} = A_{ij} \times (1 + \alpha R_{ij})$

permitting a flexible and differentiable adjustment while retaining the ability to propagate information from non-causal but informative contexts.

The adjusted attention is then normalized (with softmax) to aggregate information and passed through subsequent network layers.

This form of "causal attention" ensures that short- and long-term dependencies are prioritized according to their causal efficacy, leading to recommendations that are semantically grounded and less prone to being derailed by irrelevant correlations.

5. Empirical Results and Evaluation

CausalRec demonstrates consistent and substantial improvements across multiple benchmark datasets, including Movielens-1M, Foursquare, LastFM, and KGRec-music:

Average improvements of 7.21% in Hit Rate (HR) and 8.65% in Normalized Discounted Cumulative Gain (NDCG) over existing methods.
On specific datasets (e.g., Foursquare), gains in NDCG and HR exceeded 15% and 14% relative to competitive baselines.
Ablation studies confirm that both causal boosting and sparsity constraints are essential; naive filtering or removal of causal adjustment degrades accuracy.
The performance improvement is robust across diverse recommendation contexts, further reinforcing the broad applicability of the method.

Dataset	Best Baseline (NDCG)	CausalRec (NDCG)	Best Baseline (HR)	CausalRec (HR)
Movielens-1M	Value (example)	+8.65%	Value (example)	+7.21%
Foursquare	Value (example)	+15.49%	Value (example)	+14.65%

(Values are representative of the relative improvement and sourced from experimental results in (Hou et al., 24 Oct 2025).)

6. Theoretical and Practical Implications

The causal attention framework in sequential recommendation delivers a range of theoretical and practical advantages:

Interpretability: The causal graph provides insight into which behaviors or actions are the true drivers behind recommendations, facilitating transparent human-understandable rationales for system outputs.
Robustness: Prioritizing causally meaningful dependencies confers resilience against dataset-specific noise and spurious correlations.
Generalizability: The core methodology is compatible with the Transformer family and can be deployed in large-scale real-world systems requiring both accuracy and interpretability.
Extension: The architecture suggests a foundation for future work in multi-modal or feature-level causal discovery, deeper integration of side-information, or adaptation to domains where external interventions are frequent (e.g., marketing-driven recommendation).

7. Mathematical Formulation

Key mathematical constructs from CausalRec include:

Self-Attention:

$\mathrm{Attention}(Q, K, V) = \operatorname{softmax}\left( \frac{QK^\top}{\sqrt{d_k}} \right) V$

Causal Attention Adjustment:

$\widetilde{A}^l = A^l \odot (\mathbf{1}_n\mathbf{1}_n^\top + \alpha R)$

Linear SCM:

$X = (I - B)^{-1} \Lambda U$

Acyclicity Constraint:

$\operatorname{trace}(\exp(W \odot W)) = n$

These equations formalize the integration of causal graph-based reasoning with neural attention, ensuring both identifiability and practical trainability.

In summary, causal attention for sequential recommendation, as instantiated by CausalRec, introduces a principled, mathematically grounded framework for refining attention weights with explicit causal discovery, delivering marked gains in both accuracy and reliability. The synergy between SCM-based graph learning and context-sensitive attention mechanisms positions causal attention as a central methodology for the next generation of explainable, robust sequential recommender systems (Hou et al., 24 Oct 2025).

PDF Markdown Chat (Pro)

References (1)

CausalRec: A CausalBoost Attention Model for Sequential Recommendation (2025)

Follow Topic

Get notified by email when new papers are published related to Causal Attention for Sequential Recommendation.