AttnLRP: Attention-Aware LRP for Transformers

Updated 18 April 2026

AttnLRP is an explanation method that decomposes Transformer outputs into layer-wise relevance scores for both input features and latent representations.
The method extends classical LRP by introducing specialized rules for attention, splitting relevance uniformly across Q, K, and V, and propagating scores through softmax layers.
AttnLRP achieves computational efficiency comparable to a single backward pass, validated on models like BERT, GPT, and ViT, despite limitations like implementation invariance issues.

AttnLRP (Attention-Aware Layer-Wise Relevance Propagation) is an explanation method for Transformer networks that extends Layer-Wise Relevance Propagation (LRP) to faithfully and efficiently attribute the output prediction not only to input features but also to latent representations, especially across the non-linear attention mechanism. AttnLRP propagates a scalar output (typically a logit or a contrastive logit diff) backward through the full computational graph, assigning "relevance" scores to each neuron and input. These scores are designed to decompose the prediction and enable fine-grained interpretability of large-scale neural models such as BERT, GPT, ViT, and more recent LLMs. The approach maintains computational efficiency comparable to a single backward pass and is motivated by the need for transparent, precise explanations in domains such as language understanding, vision, and scientific modeling (Achtibat et al., 2024, Arras et al., 21 Feb 2025, You et al., 21 Oct 2025).

1. Formalism and Operational Principles

AttnLRP builds on the classical LRP principle: the output prediction $f(x)$ is recursively decomposed as a sum of relevance scores $R_i$ across network layers so that, for each layer $l$ ,

$\sum_j R_j^l = \sum_i R_i^{l-1} = f(x).$

In Transformer layers, specialized rules are introduced to handle linear projections, elementwise activations, normalization layers, and, crucially, the attention modules.

Linear Layers: The LRP-ε rule is applied:

$R_i^{l-1} = \sum_j \frac{w_{ji} a_i}{z_j + \varepsilon\,\mathrm{sign}(z_j)} R_j^l,$

where $z_j$ is the pre-activation. This rule ensures stability and relevance conservation.

Attention Layers: For a single head, with the forward computation

$Q = XW_Q,\quad K = XW_K,\quad V = XW_V, \ S = \frac{Q K^T}{\sqrt{d_k}},\quad A = \mathrm{softmax}(S), \quad O = AV,$

AttnLRP proceeds by:

Splitting incoming relevance at $O$ evenly (uniform rule) between $A$ and $V$ :

$R_i$ 0

Propagating $R_i$ 1 back to $R_i$ 2 via a customized softmax rule based on the local Jacobian:

$R_i$ 3

Splitting relevance from $R_i$ 4 equally between $R_i$ 5 and $R_i$ 6's components.

Elementwise activations are handled by the identity rule; normalizations by rewiring to an affine transformation and applying the z-rule.

These rules yield a complete backward pass that faithfully decomposes relevance layer-by-layer (Achtibat et al., 2024, Arras et al., 21 Feb 2025).

2. Theoretical Justification and Distinctive Choices

AttnLRP's attention-specific propagation mechanisms are grounded in Deep Taylor Decomposition (DTD) and conservation principles tailored for the multi-head attention structure. The method’s principal innovations over standard LRP and prior rollout-based approaches are:

Bilinear/Uniform-Split Rule: For matrix-multiplications such as $R_i$ 7 or $R_i$ 8, AttnLRP decomposes relevance equally between the two multiplicands. The uniform split is justified by recasting the product as a sum of two symmetric terms, yielding exactly 0.5 allocation to each pathway when using grad×input operations (Arras et al., 21 Feb 2025).
Softmax Rule: Instead of treating attention weights as constants, relevance is redistributed through the softmax based on the first-order local linearization (the true layerwise Jacobian), yielding an explicit analytical formula for backward relevance assignment (Arras et al., 21 Feb 2025, Achtibat et al., 2024).
Complete Attribution to Intermediate Features: All neuron activations in the network—inputs and latent units—are assigned relevance scores, enabling circuit-level and feature-level interpretability.

3. Implementation Algorithms and Efficiency

AttnLRP is implemented as a backward-pass heuristic relying on PyTorch or similar frameworks, requiring only a single forward and single backward pass. The core steps are:

Forward Pass: Compute and cache necessary activations (Q, K, V, A, O, etc.).
Backward Initialization: Set the initial relevance on the output (e.g., $R_i$ 9logit).
Custom Backward Hooks: At each layer, apply the relevant AttnLRP backward rule via PyTorch hooks or gradient×input wrappers, especially for attention matmul and softmax (Achtibat et al., 2024, Arras et al., 21 Feb 2025, Schall et al., 8 Dec 2025).
Extraction: Gather and process the token/input-level relevance vector $l$ 0 and optionally the per-layer, per-neuron relevance.

Computational cost is $l$ 1 for a single input and matches the cost of a single backward pass when checkpointing is used (Achtibat et al., 2024). The method is notably faster (1.5–2×) than ALTI-Logit and more efficient than explicit perturbation-based explainer techniques (Arras et al., 21 Feb 2025).

4. Empirical Behavior, Benchmarks, and Practical Limitations

AttnLRP has been extensively benchmarked across NLP and vision tasks, including subject–verb agreement, IMDB sentiment, ImageNet classification, QA on SQuAD, and re-ID pipelines for wildlife monitoring.

On LLMs (BERT, LLaMA-3), AttnLRP outperforms gradient-based baselines and attention rollout in faithfulness metrics (e.g., area under MoRF/LeRF perturbation curves, Intersection-over-Union plausibility, and pointing-game accuracy) (Achtibat et al., 2024, Arras et al., 21 Feb 2025). In real-world multi-modal settings (e.g., GorillaWatch for gorilla re-ID), differentiable proxies for $l$ 2-NN retrieval (e.g., soft- $l$ 3-NN margin) align AttnLRP with dense embedding models and yield interpretable pixel-level explanations that can distinguish real biometric signals from spurious background cues (Schall et al., 8 Dec 2025).

Observed Failure Modes:

Implementation Invariance Violation: AttnLRP’s bilinear split violates the implementation-invariance axiom: identical input–output functions (differing only in computation graph association) can yield different attributions. Analytical and empirical tests in linear attention models confirm that the output relevance assignment can depend on grouping rather than solely on computation (You et al., 21 Oct 2025).
Softmax Linearization Error: The Taylor-based softmax rule can misallocate relevance, especially for flat or ambiguous attention logit distributions—the non-local nature of softmax limits first-order approximations (You et al., 21 Oct 2025).
Empirical Inconsistency with LOO: AttnLRP correlates poorly with Leave-One-Out (LOO) feature importance scores, with low Pearson $l$ 4 observed between AttnLRP and LOO in both toy and standard NLP tasks. This divergence is especially significant in mid-to-late transformer layers (You et al., 21 Oct 2025).

5. Variant Methods and Diagnostic Baselines

CP-LRP (Conservative Propagation LRP) serves as a critical diagnostic: it treats $l$ 5 as linear-in- $l$ 6 only, assigns no relevance to $l$ 7, and bypasses the softmax altogether (You et al., 21 Oct 2025). Empirically, CP-LRP yields much higher agreement with LOO scores than AttnLRP in mid-to-late layers, with correlation increasing from $l$ 8 (AttnLRP) to $l$ 9 (CP-LRP) on SST-2 sentiment classification (smaller but consistent gains on IMDB).

Ablation studies demonstrate that errors concentrate in the softmax backpropagation step and are amplified in deeper layers; disabling softmax propagation selectively yields dramatic improvements in LOO alignment (You et al., 21 Oct 2025).

Other decomposition-based explainers (ALTI-Logit) rely on two-forward passes, do not attribute through MLPs, and treat attention as constant; compared to these, AttnLRP achieves higher granularity and computational efficiency (Arras et al., 21 Feb 2025).

6. Applications, Strengths, and Ongoing Directions

AttnLRP is broadly applicable for:

Model auditing: Detecting unintended reliance on background or spurious features, e.g., in scientific workflows and wildlife biometrics (Schall et al., 8 Dec 2025).
Concept-based explanations: Attribution of relevance to hidden units enables the discovery of "knowledge neurons" and context-dependent interpretation (Achtibat et al., 2024).
Efficient interpretability: Provides holistic, faithful attributions for large models without the computational burden of perturbation methods.

Strengths:

Single-pass, neuron-level, and layer-level decomposition (including latent units).
Explicit, local propagation schemes for all common Transformer modules.
Open-source and modular implementation for both PyTorch and HuggingFace ecosystems.

Limitations and Future Work:

Lack of implementation invariance and softmax linearization accuracy constrains use for absolute faithfulness assessments; suggested remedies include block-wise canonization of attention and higher-order Taylor expansions (You et al., 21 Oct 2025).
Faithfulness remains task- and instance-dependent, especially under distributional shifts and ambiguous attention alignments.
Extending to circuit discovery, active learning loops, and integrating relevance-based regularizers into training are active areas of exploration (Arras et al., 21 Feb 2025, Schall et al., 8 Dec 2025).

Method	Passes	Attributes Q/K	Attributes MLP	Invariant	Faithfulness to LOO	Efficiency
AttnLRP	1 fwd+1 bwd	Yes	Yes	No	Varies (low r)	$\sum_j R_j^l = \sum_i R_i^{l-1} = f(x).$ 0backward pass
CP-LRP	1 fwd+1 bwd	No	Yes	Yes	High (mid-late lyrs)	$\sum_j R_j^l = \sum_i R_i^{l-1} = f(x).$ 1backward pass
ALTI-Logit	2 fwd	No	No	Yes	Good (GPT-2)	$\sum_j R_j^l = \sum_i R_i^{l-1} = f(x).$ 2forward
Rollout	1 fwd	No	No	Yes	Low/negative	Very fast
I×G, IG	1 fwd+1 bwd	No	No	Yes	Noisy/weak	Standard backprop

Data from (Achtibat et al., 2024, Arras et al., 21 Feb 2025, You et al., 21 Oct 2025).

AttnLRP provides a unique balance of computational tractability, fine-grained attributions, and explicit attention logic. Ongoing research focuses on mitigating its fundamental sources of error and integrating attribution more tightly with training and circuit-level model analysis.