CAGE: Context Attribution via Graph Explanations
- The paper introduces CAGE, a framework that explains LLM generation by constructing a directed graph capturing both direct and compound token influences.
- It enforces causality and row-stochasticity constraints to ensure probabilistic attribution propagation via a well-defined matrix inversion process.
- Experimental results show significant improvements in coverage and faithfulness, outperforming baseline methods by up to 40% in key metrics.
Context Attribution via Graph Explanations (CAGE) is a principled framework designed to elucidate the generative reasoning process of LLMs by constructing and analyzing a directed attribution graph over model generations. Unlike standard context attribution techniques that map generated tokens directly to the prompt, CAGE models the interdependence between each generation and prior tokens—both prompt and generated—thus capturing not only direct but also compound, multi-step influences in the sequence generation process. The approach enforces two foundational properties—causality and row-stochasticity—on the graph structure, enabling rigorous and faithful marginalization of attribution throughout the sequence. Across multiple LLM architectures, datasets, and attribution baselines, CAGE substantially enhances the interpretability and faithfulness of global context attributions (Walker et al., 17 Dec 2025).
1. Attribution Graph Formalism and Properties
At the heart of CAGE is the attribution graph, a directed, weighted graph defined over the sequence , comprising prompt tokens and generated tokens. Each node corresponds to a token , and each edge is assigned a nonnegative weight measuring the direct normalized influence of on .
CAGE enforces two crucial structural properties:
- Causality: ; only past tokens may influence future ones, and only into generated tokens, enforcing acyclicity.
- Row-Stochasticity: For each , . Each generated token's attributions sum to unity, preserving a probabilistic distribution of influence.
These constraints yield a lower-triangular adjacency matrix , with zeros on prompt rows and strictly lower-triangular nonzero entries for generated token influences.
2. Graph Construction Algorithm
CAGE constructs using base attributions that quantify the direct influence of each prior token on a given generation according to a chosen base method (e.g., perturbation, integrated gradients, context-length probing, etc.). For each generated token ():
- Direct Attribution: Compute , a vector of influence scores for each predecessor token.
- Nonnegativity Enforcement: Apply per entry to disregard negative contributions.
- Row-Stochastic Normalization: If , set for ; otherwise (if all direct attributions vanish), distribute weight uniformly among predecessors.
This procedure guarantees the resulting (i) encodes a DAG aligned with the autoregressive temporal order and (ii) interprets edge weights probabilistically for each token’s direct influences.
3. Attribution Computation via Path Marginalization
The constructed attribution graph admits a well-defined mechanism for context attribution propagation. Influence from any source token to target may traverse any directed path , with path weight . The marginal attribution is then the sum of weights over all such paths.
In matrix terms, the overall influence matrix is computed as
where gives the total marginal influence from to . For a set of output tokens , the aggregate context attribution onto the prompt is
Each row of sums to 1, preserving the stochastic attribution mass for each target, and the total attribution for all outputs sums to .
4. Theoretical and Mathematical Guarantees
The causal (acyclic) and stochastic (probabilistic) construction ensures that:
- The Neumann series for converges, as 's spectral radius does not exceed 1.
- All computed attributions are nonnegative, preventing oscillation or cancellation effects that distort interpretability.
- The resulting attribution graph provides interpretable, stable, and mathematically principled pathwise decompositions of influence—any future token’s “credit” can be exactly traced back to prompt and intermediate generations.
Ablation experiments reveal that omitting either nonnegativity or row normalization leads to path-sum explosion or sign oscillation, severely degrading explanation quality.
5. Experimental Benchmarking and Quantitative Results
CAGE was evaluated on multiple LLMs—Llama 3 (3B, 8B) and Qwen 3 (4B, 8B)—and three datasets: Facts (Feverous), Math (MultiArith, SVAMP), and MorehopQA, probing various reasoning capacities and attribution demands.
A range of base attribution methods were compared, including perturbation, context-length probing, ReAGent, integrated gradients, and Attention × IG. Metrics included Attribution Coverage (fraction of ground-truth prompt sentences receiving substantive attribution) and attribution faithfulness via RISE and MAS (sensitivity of output generation to deletions of top-attributed tokens).
Key experimental outcomes:
| Setting | Row Attribution | CAGE | Relative Gain |
|---|---|---|---|
| Math, Llama 3 3B, IG, AC | 0.551 ± 0.28 | 0.635 ± 0.28 | 15% |
| Math (avg, AC) | — | up to 40% | max 134% (17/20 improved) |
| MorehopQA, Llama 3 8B, MAS | 0.526 ± 0.16 | 0.458 ± 0.15 | 13% reduction (faithfulness) |
| MorehopQA, 8B (all metrics) | — | avg 11% | max 30% |
| Math, all models (faithful.) | — | avg 16% | max 37% |
CAGE consistently improved both coverage and faithfulness metrics, outperforming row-based baselines in 97% of tested model/method pairs.
6. Computational and Modeling Considerations
The principal computational costs are:
- for constructing (one base attribution per generation).
- Either (iterative propagation) or (direct matrix inversion for ); networks are typically small in (sentence-level, –20), making these costs practical.
Modeling limitations include the suppression of negative (inhibitory) attributions, possible underrepresentation of strong rare direct links due to normalization, and the abstraction of transformer nonlinearity via linear Neumann series propagation. The faithfulness of CAGE explanations ultimately depends on the underlying base attribution method; unfaithful base attributions propagate through the graph.
7. Context within Attribution and Explainability Research
CAGE generalizes prior context attribution techniques by explicitly modeling intermediate generational dependencies, addressing the incompleteness of direct prompt-to-token mappings. The framework harmonizes attribution propagation, enablement of causal interpretability, and probabilistic mass conservation. Its robust empirical and theoretical foundations solidify its role as a comprehensive approach for context attribution in autoregressive LLMs (Walker et al., 17 Dec 2025). This suggests potential for further extension to other autoregressive modeling domains where inter-step dependencies challenge standard attribution methodologies.