Relevance Patching (RelP): Efficient Neural Attribution
- Relevance Patching (RelP) is a method that uses LRP-derived propagation coefficients to efficiently attribute neural behaviors to specific internal components.
- RelP overcomes limitations of activation and attribution patching by achieving high correlation to ground truth with minimal computational cost (2 forward, 1 backward pass).
- RelP aids in circuit discovery and model debugging by providing sparse, interpretable causal maps, enabling targeted pruning and safer model deployment.
Relevance Patching (RelP) is a principled, efficient methodology for mechanistically attributing behaviors within deep neural networks to specific internal components or “circuits.” Originating from limitations in existing patching approaches—activation patching being compute-intensive and attribution patching suffering significant faithfulness loss in deep or highly non-linear architectures—RelP leverages propagation coefficients derived from Layer-wise Relevance Propagation (LRP) to yield efficient, reliable causal maps of model circuitry. RelP delivers dramatically increased correlation to activation patching “ground truth,” achieves sparse, interpretable circuit identification, and is directly applicable to large models where intervention-heavy methods are prohibitively expensive.
1. Conceptual Foundations and Motivation
RelP is designed for mechanistic interpretability, specifically to address drawbacks in activation and attribution patching methods. Activation patching, which measures the causal impact of replacing component activations with those from a patched input, is accurate but requires extensive intervention sweeps—thus computationally infeasible in large or deep models. Attribution patching attempts to approximate this by using local gradients (i.e., the derivative of output wrt. component activation), but in highly non-linear networks—especially those employing LayerNorm, residual connections, or large MLPs—these gradients are too noisy, yielding low faithfulness, e.g., Pearson correlation of 0.006 for MLP outputs in GPT-2 Large compared to 0.956 for RelP (Jafari et al., 28 Aug 2025). RelP’s innovation is the substitution of these noisy gradients with theoretically principled propagation coefficients obtained via Layer-wise Relevance Propagation, ensuring conservation and improved signal-to-noise ratio.
2. Mechanism: LRP-based Relevance Patching
RelP operates in three primary steps, mirroring the efficiency of attribution patching, but fundamentally altering the scoring rubric. Given a model , inputs and , and a component whose contribution to behavior is assessed:
Here, is the LRP-derived propagation coefficient for . Layer-wise, LRP decomposes output activations via a first-order Taylor expansion:
Relevance for input component of layer is then:
LRP’s local propagation rules can be customized for relevance conservation, zero constraints, or other interpretability desiderata. These coefficients allow RelP to sidestep gradient pathologies, yielding faithful component-level attribution in both linear and non-linear regions.
3. Computational Efficiency and Practical Implementation
RelP requires only two forward passes (original and patched inputs) and a single backward pass for relevance propagation, matching the computational profile of attribution patching. This efficiency—contrasted with the repeated intervention sweeps required for activation patching and the integration steps demanded by methods such as Integrated Gradients (IG)—makes RelP tractable even for high-dimensional settings and large model families. Implementation can reuse frameworks supporting LRP, and is compatible with standard auto-differentiation and vectorization approaches.
4. Faithfulness and Quantitative Validation
Experimental validation on the Indirect Object Identification (IOI) task demonstrates consistently high faithfulness. In GPT-2 Large, RelP attains Pearson correlation of 0.956 for MLP outputs, and similarly high values for other challenging components, while attribution patching is near-zero (Jafari et al., 28 Aug 2025). RelP additionally matches the faithfulness of Integrated Gradients in sparse feature circuit discovery, but does so with considerably lower computational expense. Qualitative results confirm reliability across non-linear modules such as the residual stream and MLPs, where attribution patching suffers due to non-linearity and activation magnitude variance introduced by LayerNorm.
Method | MLP Output Correlation (GPT-2 Large) | Computational Cost |
---|---|---|
Activation patching | Baseline (Ground Truth) | Dozens/hundreds sweeps |
Attribution patching | 0.006 | 2× fwd, 1× bwd |
RelP | 0.956 | 2× fwd, 1× bwd |
IG | Comparable to RelP | Multiple integrations |
Faithfulness here refers to how closely each method’s component-level impact scores match those from “ground truth” activation patching.
5. Applications to Circuit Discovery and Model Debugging
RelP is applicable wherever internal causal circuit discovery is required, particularly for mechanistic investigation in large-scale LLMs. Examples include:
- Identifying and visualizing sparse feature circuits governing behaviors such as subject–verb agreement, IOI, or other semantic dependencies.
- Localizing critical heads, MLPs, or neurons for tasks, facilitating targeted pruning, interpretation, or fine-grained editing.
- Debugging models by mapping causal flow and pinpointing “faulty” or spurious components responsible for undesirable outputs.
RelP’s reliability in deep non-linear transformer modules suggests direct utility in research settings scaling to models like Pythia, Qwen2, or Gemma2 (Jafari et al., 28 Aug 2025).
6. Implications for Interpretability Research
RelP’s integration of LRP principles into the patching paradigm represents an advance in balancing efficiency and faithfulness. Models analyzed via RelP yield interpretable causal maps at component resolution, informing both scientific understanding and downstream engineering—e.g., for safe model deployment, robust policy behavior, or regulatory compliance. A plausible implication is that RelP will supplant attribution patching in settings prioritizing high correlation to ground truth, especially with increasingly deep and non-linear architectures.
7. Limitations and Future Directions
While RelP is tested on LLMs and tasks such as IOI, broader validation in other modalities, network architectures, and adversarial settings remains an open avenue. LRP propagation rules may need further customization for particular circuit topologies or regularization objectives. This suggests future work could refine propagation rules, extend RelP to multimodal models, and automate the identification of optimal patching locations for large architectures.
RelP is thus a methodologically rigorous, computationally tractable, and experimentally validated approach for circuit discovery and analysis in neural networks, directly advancing mechanistic interpretability of complex AI systems.