Layer-Wise Relevance Propagation Explained

Updated 7 November 2025

Layer-Wise Relevance Propagation (LRP) is a method that explains neural network predictions by assigning both positive and negative relevance scores to input features.
It propagates output relevance backward through layers, enforcing a conservation property and using stabilizers to maintain numerical stability.
Empirical studies in NLP, such as word-deletion experiments and PCA clustering, show LRP’s superior interpretability compared to gradient-based approaches.

Layer-Wise Relevance Propagation (LRP) is a principled framework for explaining predictions of complex neural network classifiers by attributing the model output backward through the architecture to generate additive input-level relevance scores. Designed to provide diagnostic insight into black-box models, LRP rigorously decomposes a prediction so that each input feature receives a signed relevance value indicating its quantitative contribution—positive or negative—to the decision.

1. Core Principles and Mathematical Foundations

LRP operates by propagating the classifier's output $f(x)$ backward through the network, distributing relevance $R$ to each lower-layer neuron such that the sum of input relevances remains equal to the output (the conservation property). For input $x$ , the decomposition is governed by

$f(x) = \sum_d R_d$

where $R_d$ denotes the relevance attributed to input dimension $d$ .

In a general feedforward network, consider a neuron $x_j = g(z_j)$ in layer $l+1$ , where

$z_j = \sum_i z_{ij}, \quad z_{ij} = x_i w_{ij} + \frac{b_j}{n}$

with input neurons $x_i$ , weights $w_{ij}$ , bias $b_j$ , and activation function $g$ . LRP propagates relevance $R_j$ from a higher-layer neuron $x_j$ to its inputs $x_i$ by: $R_{i \leftarrow j} = \frac{z_{ij} + \frac{s(z_j)}{n}}{\sum_i z_{ij} + s(z_j)} R_j$ where the stabilizer $s(z_j) = \epsilon \cdot (1_{z_j \geq 0} - 1_{z_j < 0})$ with $\epsilon=0.01$ prevents division by small denominators. The accumulated relevance for each lower-layer neuron is $R_i = \sum_j R_{i \leftarrow j}$ .

In deep architectures using embedding layers or max-pooling, the relevance for an embedding-based input (e.g., a word in NLP) is summed across embedding dimensions: $R(w_t) = \sum_{i=1}^{D} R_{i, t}$ with special winner-take-all redistribution for max-pooling layers: relevance is assigned only to activations that achieved the pool's maximum.

2. LRP in Practice: Natural Language Processing

The primary domain explored is topic categorization using convolutional neural networks over the 20Newsgroups dataset. Documents are encoded as sequences of 300-dimensional pretrained word2vec embeddings (first 400 tokens), fed to a CNN structured as Conv → ReLU → 1-Max-Pool → Fully Connected. The network obtains a test set accuracy of 80.19%. LRP is conducted post-training to analyze per-word and per-dimension relevance, with heatmaps produced for both granular and word-level inspection.

3. Comparative Approach: LRP Versus Sensitivity Analysis

LRP is benchmarked against sensitivity analysis (SA), a standard gradient-based method where: $R_d = \left( \frac{\partial f}{\partial x_d} \right)^2$ while for words in embedding space,

$R_{\text{SA}}(w_t) = \| \nabla_{w_t} f(d) \|_2^2$

SA measures local, positive-only influence based on the magnitude of change in output under infinitesimal input variations; it does not distinguish between evidence supporting or contradicting the class decision.

In contrast, LRP provides signed relevances, attributing both supporting and inhibiting contributions, which is demonstrated to be critical for nuanced interpretability.

4. Empirical Validation and Evaluation Protocols

A. Perturbation-Based Deletion Experiments

A robust evaluation utilizes a word-deletion protocol, removing document words in order of decreasing or increasing relevance (per LRP or SA), by zeroing their embedding vectors:

For correct predictions: Deleting highest-relevance words first (by LRP) results in a steeper drop in classification accuracy than SA or random deletion, indicating LRP's superior identification of crucial evidence.
For incorrect predictions: Deleting least relevant words (LRP) can increase accuracy, as inhibiting words are suppressed. SA does not produce a comparable gain, highlighting its limitation in distinguishing negative evidence.

B. PCA Analysis of Document Representation

Document representations are recalculated by relevance-weighted linear combinations of word embeddings (per LRP and SA); PCA reveals that LRP-based reweighting produces well-separated clusters corresponding to document categories, exhibiting a better alignment with decision-relevant features learned by the CNN.

C. Visualization of Relevance Heatmaps

LRP and SA visualizations color-code input text by relevance. LRP distinctly highlights both positive and negative evidence—clarifying which words support or undermine the assigned class. For example, a term like “ride” contributes negative relevance for non-motorcycle classes but positive for the motorcycle class. SA and related gradient methods attribute mostly positive scores and cannot disambiguate support versus contradiction.

5. Theoretical and Practical Implications

LRP delivers a function-value-conserving signed decomposition, yielding a relevance distribution across input features that is both targeted and contextually meaningful. In NLP tasks with deep CNNs:

LRP heatmaps are actionable and semantically interpretable at both embedding-dimension and word level.
LRP enables direct comparison between input features in terms of their decisive versus inhibitive roles for the predicted outcome.
The approach quantitatively surpasses gradient-based approaches in deletion experiments and qualitatively in visualizations and semantic alignment.

A plausible implication is that LRP’s ability to generate negative relevance is particularly valuable in applications where understanding contradiction or model uncertainty is crucial, such as adversarial text detection or model debugging.

6. Summary of Key Results and Broader Significance

Empirical evidence validates LRP as a superior technique for explaining CNN-based NLP models relative to SA:

Word deletion: LRP-selected deletions yield rapid performance deterioration (correct samples) or recovery (incorrect samples).
PCA clustering: LRP-weighted document embeddings allow clearer class separation.
Heatmap interpretability: LRP captures context and negation more richly than SA, reflecting model usage of both presence and absence of evidence.

These findings position LRP as an effective and reliable method for interpreting neural classifiers in NLP, offering actionable explanations that go beyond the capability of prevalent gradient-based attribution techniques. The paper’s methodology and validations align with results from LRP’s prior applications in image classification, demonstrating its robust generalization to text. This establishes a foundation for principled interpretability in deep NLP systems and suggests broader relevance for domains requiring function-value-conserving, fine-grained, and signed input attributions.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Layer-Wise Relevance Propagation (LRP).