Layerwise & Tokenwise Attribution
- Layerwise and tokenwise attribution are analytical methods that decompose neural predictions by distributing credit among intermediate layers and input tokens.
- These techniques leverage decomposition, gradient, integrated gradient, and norm-based strategies to ensure that attributions add up to the final prediction.
- They enhance model interpretability, aid in efficient pruning, and support robust evaluation across language, vision, and federated learning applications.
Layerwise and tokenwise attribution are formal analytical methodologies to decompose model predictions in deep neural architectures—especially Transformers—into contributions from each layer and individual input token. These approaches quantify the “credit assignment” problem: how do specific tokens and representations at various depths drive network outputs? Recent advances have refined these methods for both interpretability and computational tractability, providing granular insights into neural model behavior across language, vision, and federated learning contexts.
1. Mathematical Foundations and Attribution Definitions
Layerwise attribution refers to distributing the model’s output (e.g., predicted logit) across the intermediate layers such that each layer’s operations (attention, MLP, normalization, etc.) receives an interpretable “credit” for its role in generating the final prediction. Tokenwise attribution aims to resolve how much each input token (or hidden token at any layer) contributes to the prediction, often through scalar or vector scores.
Core classes of attribution methods include:
- Decomposition-based approaches: These redistribute the model’s output (typically a logit) through the full network by conservation rules, ensuring additivity. Notable representatives are LRP, AttnLRP, and ALTI-Logit, all of which precisely track relevance from output back to either inputs or internal components (Arras et al., 21 Feb 2025).
- Gradient-based methods: Saliency maps and gradient×input perform a single backward pass to compute local sensitivities, often aggregated per token (Xiong et al., 2018).
- Integrated gradients and path methods: These integrate the gradients along a counterfactual path (e.g., from a baseline to the actual attention matrix) for more faithful globally-consistent attribution (Hao et al., 2020).
- Norm- and activation-based aggregations: Rollout and norm-based methods aggregate single-layer influence over the model depth, adjusted for architectural specifics (residuals, LayerNorm, FFN pathways) (Modarressi et al., 2022).
Explicit conservation is a central principle: for decomposition approaches, sum of layerwise or tokenwise attributions must match the model output or contrastive logit. For example: where is the MLP relevance for layer , and the token relevance for input token (Arras et al., 21 Feb 2025).
2. Prominent Methodologies: Algorithms and Formalism
a) Self-Attention Attribution (Integrated Gradient Across Layers)
“AttAttr” computes attribution for every attention score in each head/layer by integrating the gradient from a “null” attention matrix up to the actual attention in use: The aggregated attribution over heads/layers yields both layerwise and tokenwise scores. These attributions are directly usable for head-pruning and for constructing information flow trees revealing hierarchical model processing (Hao et al., 2020).
b) Decomposition-based Attribution: LRP, AttnLRP, ALTI-Logit
- LRP leverages layer-wise conservation, propagating backward through every operation (linear layers, activations, LayerNorm, softmax) with rigorously defined redistribution rules (e.g., ε-rule for stabilization):
- AttnLRP refines LRP for Transformers by treating softmax and product layers uniquely (including a gradient×input mapping at softmax):
- ALTI-Logit decomposes the prediction logit into explicit additive contributions from each residual block and attention head using linearization of attention weights and LayerNorm variance (Arras et al., 21 Feb 2025).
c) Rollout and Norm-based Global Attribution
GlobEnc aggregates tokenwise attributions from every encoder layer, accounting for all major architectural pathways (attention, both residuals, both LayerNorms). It propagates token-to-token contribution matrices via a rollout operator: Iterating this through all 0 layers yields global per-token scores highly correlated with saliency, outperforming attention-only baselines (Modarressi et al., 2022).
d) Target-Propagation and Efficient Dual Paths
Dual Path Attribution (DPA) leverages the analytical linearity of SwiGLU Transformer blocks to propagate the output’s unembedding vector downward through all component pathways (MHSA, GLU, LayerNorm) in a single backward pass. The critical innovation is that only one forward and one backward pass are required, in contrast with 1 runs for activation patching or integrated gradients—yielding layerwise and tokenwise attributions in 2 time with respect to model size (Jantsch et al., 20 Mar 2026).
3. Empirical Protocols and Evaluation Metrics
Quantitative evaluation protocols and metrics are essential for validating faithfulness and specificity of attributions. Standard approaches include:
- Feature-removal perturbation: Remove top- or bottom-scoring tokens, embedding features, or filters; re-run prediction and measure the drop in confidence or accuracy. LRP consistently yields the sharpest drop when its top-ranked features are removed, surpassing gradient saliency (Xiong et al., 2018).
- Pointing Game (PG₂), MRR, RMA, Per-Token Accuracy (PTA): Benchmarks for attribution alignment with ground-truth linguistic evidence, such as subject tokens driving subject–verb agreement in Transformers (Arras et al., 21 Feb 2025).
- Head Pruning/Layer Pruning: Sort heads/layers by attribution, prune lowest-scoring, and quantify degradation in performance. Attribution-based criteria are more predictive of impact than mean attention or Taylor residuals (Hao et al., 2020).
- Correlation with saliency baselines: GlobEnc’s per-token attributions demonstrate much higher rank correlation with gradient×input saliency than attention or naive norm rollups (Modarressi et al., 2022).
4. Applications and Case Studies
Layerwise and tokenwise attribution methodologies yield actionable insights in multiple domains:
- Transformer pruning: Bottom-ranked attention heads (by integrated-gradient attribution) can be pruned with minimal accuracy loss on diverse NLU tasks; these attribution metrics outperform attention mean as predictors (Hao et al., 2020).
- Information flow and interpretability: Attribution trees reveal how information about key tokens (“positive,” “matched,” or class-based pivots) emerges and converges in deep networks. Tree visualization and edge distance distributions display the network’s receptive field evolution from local to global aggregation with depth (Hao et al., 2020).
- Adversarial triggers: Regions with highest attribution scores signal vulnerable input spans; inserting these as triggers in held-out data causes catastrophic accuracy drops (e.g., in MNLI from 82.87% to 0.8%) (Hao et al., 2020).
- Federated learning provenance: ProToken’s layerwise selection and gradient-weighted per-token attribution achieves 98.6% client localization accuracy; accuracy is preserved for large client populations (scaling from 6 to 55) (Gill et al., 27 Jan 2026).
- Faithful LLM circuit tracing: DPA efficiently measures contribution of heads and neurons at scale, achieving near-perfect faithfulness and recovery on factual, comprehension, and sentiment tasks with 40× speedup over existing baselines (Jantsch et al., 20 Mar 2026).
5. Best Practices, Method Comparisons, and Recommendations
Direct quantitative comparison across methods on real tasks yields several operational guidelines (Arras et al., 21 Feb 2025):
- Model-family dependency: ALTI-Logit is optimal for pre-LayerNorm autoregressive models (GPT-2, OPT); AttnLRP excels for LLaMA; AttnLRP/LRP are best in post-LayerNorm bidirectional models (BERT). Proper handling of LayerNorm and attention linearization is necessary for conservation.
- Contrastive attribution and metric selection: Reporting both logit-value and logit-difference attributions (e.g., for the predicted class vs alternatives) with evaluation on linguistically-grounded datasets gives more interpretable, robust results than toy per-token sign tests.
- Aggregation and visualization: Combining layerwise and tokenwise attributions reveals both depthwise and tokenwise focalization of evidence, useful in both interpretation and debugging.
- Algorithmic efficiency: Techniques such as DPA and GlobEnc deliver attribution scores with a single forward/backward pass (plus efficient matrix computations), making them tractable even in large LLMs.
A table summarizing key attribution methods and their distinctive features is provided below:
| Method | Conservation Principle | Efficiency | Best-Suited Architecture |
|---|---|---|---|
| LRP | Layerwise sum matches output | Single backward | BERT (bidirectional, Post-LN) |
| AttnLRP | Additive, softmax-invariant | Single backward | LLaMA (Pre-LN), general Transformers |
| ALTI-Logit | Block-additive, token mapping | Two passes | GPT-2, OPT (autoregressive) |
| GlobEnc | Norm+residual+LN, rollout | Forward + matrices | Encoder Transformers (BERT, Electra) |
| DPA | Pathwise target propagation | 1 fwd, 1 bwd (O(1)) | Modern SwiGLU LLMs |
| ProToken | Layer selection + gradients | Last-N layers, per-token | Federated LLMs |
6. Limitations and Ongoing Challenges
Layerwise and tokenwise attribution methods face limitations and continue to evolve:
- Model assumptions: LRP and ALTI-Logit require architectural priors (e.g., LayerNorm placement, block structure).
- Faithfulness vs. computational cost: Integrated gradients and activation patching are faithful but expensive; path-propagation, decomposition, and norm-based rollouts address this trade-off with architectural insights (Jantsch et al., 20 Mar 2026, Modarressi et al., 2022).
- Hindsight bias: Classic erasure methods can prune tokens solely because the model already “knows” the target, not because those tokens are truly superfluous at inference; amortized differentiable masking mitigates this (Cao et al., 2020).
- Granularity: Attribution can be extended below the token/head level, to neurons or specific substructures; current work (e.g., DPA) supports dense scaling at these fine granularities.
A plausible implication is that combining sparsity-inducing regularization, targeted component selection, and pathwise conservation offers a scalable route to “white-box” interpretability for large, deep neural LLMs.
7. Future Directions
Research is trending toward combining attribution with robustness diagnostics, fairness auditing, and real-time provenance in collaborative environments:
- Groupwise and structured attributions: Attribution for interacting sets of tokens or structured elements (edges, subtrees) remains a frontier (Cao et al., 2020).
- Attribution under privacy and aggregation constraints: Provenance under secure federated aggregation, differential privacy, and streaming generation is an active area (Gill et al., 27 Jan 2026).
- Circuit and subnetwork tracing: Attribution scores enable not just input-explanation, but tracing minimal sub-circuits responsible for key behaviors, e.g., through early recovery and disruption metrics (Jantsch et al., 20 Mar 2026).
- Compositional and cross-modal explanations: Extending these methodologies to multimodal systems and structured outputs is a promising pathway.
Layerwise and tokenwise attribution, as formalized, evaluated, and compared across architectures, constitute a cornerstone of contemporary neural network interpretability, yielding actionable diagnosis, model editing, and principled accountability in large-scale deployed systems (Hao et al., 2020, Arras et al., 21 Feb 2025, Modarressi et al., 2022, Jantsch et al., 20 Mar 2026, Gill et al., 27 Jan 2026, Cao et al., 2020, Xiong et al., 2018).