CAGE: Context Attribution via Graph Explanations

Updated 24 December 2025

The paper introduces CAGE, a framework that explains LLM generation by constructing a directed graph capturing both direct and compound token influences.
It enforces causality and row-stochasticity constraints to ensure probabilistic attribution propagation via a well-defined matrix inversion process.
Experimental results show significant improvements in coverage and faithfulness, outperforming baseline methods by up to 40% in key metrics.

Context Attribution via Graph Explanations (CAGE) is a principled framework designed to elucidate the generative reasoning process of LLMs by constructing and analyzing a directed attribution graph over model generations. Unlike standard context attribution techniques that map generated tokens directly to the prompt, CAGE models the interdependence between each generation and prior tokens—both prompt and generated—thus capturing not only direct but also compound, multi-step influences in the sequence generation process. The approach enforces two foundational properties—causality and row-stochasticity—on the graph structure, enabling rigorous and faithful marginalization of attribution throughout the sequence. Across multiple LLM architectures, datasets, and attribution baselines, CAGE substantially enhances the interpretability and faithfulness of global context attributions (Walker et al., 17 Dec 2025).

1. Attribution Graph Formalism and Properties

At the heart of CAGE is the attribution graph, a directed, weighted graph $G = (V, E, w)$ defined over the sequence $V = \{v_1,\ldots,v_T\}$ , comprising $P$ prompt tokens and $Y = T - P$ generated tokens. Each node $v_i$ corresponds to a token $x_i$ , and each edge $(v_i \to v_j) \in E$ is assigned a nonnegative weight $w(v_i \to v_j) \in \mathbb{R}_+$ measuring the direct normalized influence of $x_i$ on $x_j$ .

CAGE enforces two crucial structural properties:

Causality: $E \subset \{ (v_i, v_j) \mid 1 \leq i < j \leq T,\; j > P \}$ ; only past tokens may influence future ones, and only into generated tokens, enforcing acyclicity.
Row-Stochasticity: For each $j > P$ , $\sum_{i<j} w(v_i \to v_j) = 1$ . Each generated token's attributions sum to unity, preserving a probabilistic distribution of influence.

These constraints yield a lower-triangular adjacency matrix $A \in \mathbb{R}^{T \times T}$ , with zeros on prompt rows and strictly lower-triangular nonzero entries for generated token influences.

2. Graph Construction Algorithm

CAGE constructs $G$ using base attributions that quantify the direct influence of each prior token on a given generation according to a chosen base method (e.g., perturbation, integrated gradients, context-length probing, etc.). For each generated token $x_j$ ( $j > P$ ):

Direct Attribution: Compute $s = M(f, x_j, (x_1, ..., x_{j-1}))$ , a vector of influence scores for each predecessor token.
Nonnegativity Enforcement: Apply $\phi(s_i) = \max(s_i, 0)$ per entry to disregard negative contributions.
Row-Stochastic Normalization: If $\sum \phi(s_i) > 0$ , set $A_{j,i} = \phi(s_i) / \sum \phi(s_i)$ for $i = 1, ..., j-1$ ; otherwise (if all direct attributions vanish), distribute weight uniformly among predecessors.

This procedure guarantees the resulting $A$ (i) encodes a DAG aligned with the autoregressive temporal order and (ii) interprets edge weights probabilistically for each token’s direct influences.

3. Attribution Computation via Path Marginalization

The constructed attribution graph admits a well-defined mechanism for context attribution propagation. Influence from any source token $k$ to target $t$ may traverse any directed path $\pi = (k \to i_1 \to i_2 \to \ldots \to t)$ , with path weight $C(\pi) = \prod_{(u\to v)\in\pi} w(u\to v)$ . The marginal attribution $A_{k\to t}$ is then the sum of weights over all such paths.

In matrix terms, the overall influence matrix is computed as

$\Theta = (I - A)^{-1} = I + A + A^2 + \cdots$

where $\Theta_{t,i}$ gives the total marginal influence from $x_i$ to $x_t$ . For a set of output tokens $\mathcal{O}\subset Y$ , the aggregate context attribution onto the prompt is

$\alpha = \sum_{t\in \mathcal{O}} \Theta_{t, 1:P}$

Each row of $\Theta$ sums to 1, preserving the stochastic attribution mass for each target, and the total attribution for all outputs sums to $|\mathcal{O}|$ .

4. Theoretical and Mathematical Guarantees

The causal (acyclic) and stochastic (probabilistic) construction ensures that:

The Neumann series for $(I - A)^{-1}$ converges, as $A$ 's spectral radius does not exceed 1.
All computed attributions are nonnegative, preventing oscillation or cancellation effects that distort interpretability.
The resulting attribution graph provides interpretable, stable, and mathematically principled pathwise decompositions of influence—any future token’s “credit” can be exactly traced back to prompt and intermediate generations.

Ablation experiments reveal that omitting either nonnegativity or row normalization leads to path-sum explosion or sign oscillation, severely degrading explanation quality.

5. Experimental Benchmarking and Quantitative Results

CAGE was evaluated on multiple LLMs—Llama 3 (3B, 8B) and Qwen 3 (4B, 8B)—and three datasets: Facts (Feverous), Math (MultiArith, SVAMP), and MorehopQA, probing various reasoning capacities and attribution demands.

A range of base attribution methods were compared, including perturbation, context-length probing, ReAGent, integrated gradients, and Attention × IG. Metrics included Attribution Coverage (fraction of ground-truth prompt sentences receiving substantive attribution) and attribution faithfulness via RISE and MAS (sensitivity of output generation to deletions of top-attributed tokens).

Key experimental outcomes:

Setting	Row Attribution	CAGE	Relative Gain
Math, Llama 3 3B, IG, AC	0.551 ± 0.28	0.635 ± 0.28	15%
Math (avg, AC)	—	up to 40%	max 134% (17/20 improved)
MorehopQA, Llama 3 8B, MAS	0.526 ± 0.16	0.458 ± 0.15	13% reduction (faithfulness)
MorehopQA, 8B (all metrics)	—	avg 11%	max 30%
Math, all models (faithful.)	—	avg 16%	max 37%

CAGE consistently improved both coverage and faithfulness metrics, outperforming row-based baselines in 97% of tested model/method pairs.

6. Computational and Modeling Considerations

The principal computational costs are:

$O(T\cdot \text{Cost}(M))$ for constructing $G$ (one base attribution per generation).
Either $O(T^2)$ (iterative propagation) or $O(T^3)$ (direct matrix inversion for $\Theta = (I-A)^{-1}$ ); networks are typically small in $T$ (sentence-level, $T \approx 10$ –20), making these costs practical.

Modeling limitations include the suppression of negative (inhibitory) attributions, possible underrepresentation of strong rare direct links due to normalization, and the abstraction of transformer nonlinearity via linear Neumann series propagation. The faithfulness of CAGE explanations ultimately depends on the underlying base attribution method; unfaithful base attributions propagate through the graph.

7. Context within Attribution and Explainability Research

CAGE generalizes prior context attribution techniques by explicitly modeling intermediate generational dependencies, addressing the incompleteness of direct prompt-to-token mappings. The framework harmonizes attribution propagation, enablement of causal interpretability, and probabilistic mass conservation. Its robust empirical and theoretical foundations solidify its role as a comprehensive approach for context attribution in autoregressive LLMs (Walker et al., 17 Dec 2025). This suggests potential for further extension to other autoregressive modeling domains where inter-step dependencies challenge standard attribution methodologies.

PDF Markdown Chat (Pro)

References (1)

Explaining the Reasoning of Large Language Models Using Attribution Graphs (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Context Attribution via Graph Explanations (CAGE).