Papers
Topics
Authors
Recent
Search
2000 character limit reached

FlashTrace: Efficient Multi-Token Attribution

Updated 9 February 2026
  • FlashTrace is an attribution framework that assigns causal credit from output spans to source inputs using span-wise aggregation and recursive hops.
  • It reduces computational complexity from O(MND) to O(ND) by aggregating token contributions, enabling rapid and scalable interpretability in long-context scenarios.
  • Empirical evaluations show that a single recursive hop recovers over 90% of attribution mass on true input tokens, greatly enhancing output explanation fidelity.

FlashTrace refers to an efficient, faithful multi-token attribution framework designed to address the interpretability of outputs from LLMs with extended reasoning chains. The method enables attribution of causality for spans of output tokens to source input tokens, maintaining computational efficiency and high faithfulness even in long-context, multi-step reasoning scenarios. FlashTrace leverages span-wise aggregation and a recursive attribution mechanism to provide attribution over arbitrarily large target spans with minimal computational overhead, significantly improving the speed and quality of interpretability analyses for complex transformer-based models (Pan et al., 2 Feb 2026).

1. Motivation and Problem Formulation

Modern LLMs frequently produce long output sequences—often involving intermediate reasoning steps (T) before emitting final answers (O)—especially on tasks like multi-hop QA, code generation, or mathematical reasoning. Traditional token attribution methods, such as Integrated Gradients (IG), Attention-LRP, and IFR, are fundamentally designed for single-token analysis. To attribute a contiguous span of MM output tokens in a context of length NN requires O(MN)O(MN) operations per layer, typically necessitating repeated passes over the model and leading to prohibitive runtimes for M,N500010000M, N \approx 5\,000-10\,000. A critical shortcoming of these approaches is faithfulness: they tend to accumulate almost all attribution mass on reasoning tokens immediately preceding the answer, effectively preventing importance from tracing back to the original input tokens (I). Empirically, the Recovery Rate for ground-truth inputs falls below 10% in long-chain settings.

2. Span-Wise Aggregation for Computational Efficiency

FlashTrace introduces span-wise aggregation as its core efficiency principle. Instead of looping over each output token iSi \in S in the target span, all output tokens to be explained are grouped, and the model attributes to the "super-token" representing the sum of their representations:

YS=iSyi.\mathbf{Y}_S = \sum_{i \in S} \mathbf{y}_i.

For a source token jj, the aggregated contribution to span SS is:

ZjS=iSαi,jvj=(iSαi,j)vj,\mathbf{Z}_{j\to S} = \sum_{i\in S} \alpha_{i,j} \mathbf{v}_j = \Bigl(\sum_{i\in S} \alpha_{i,j}\Bigr) \mathbf{v}_j,

where αi,j\alpha_{i,j} denotes attention weights and vj\mathbf{v}_j is the Value vector. Attribution is scored by an L1-based proximity metric:

Prox(Z,Y)=max(0,YZ1+Y1).\mathrm{Prox}(\mathbf{Z}, \mathbf{Y}) = \max\bigl(0, -\|\mathbf{Y}-\mathbf{Z}\|_1 + \|\mathbf{Y}\|_1\bigr).

This approach computes iSαi,j\sum_{i\in S} \alpha_{i,j} as a scalar operation O(MN)O(MN) and then projects vectors in O(ND)O(ND), reducing total time per layer from O(MND)O(MND) to O(MN+ND)O(MN + ND), which is practically O(ND)O(ND) since DMD \gg M. This complexity reduction (from O(MND)O(MND) to O(ND)O(ND)) enables FlashTrace to run attribution over thousands of tokens in a single pass per hop.

3. Recursive Attribution: Faithfully Tracing Reasoning Chains

To address faithfulness and propagate importance through intermediate reasoning steps, FlashTrace implements a recursive K-hop attribution scheme:

  1. Hop 0 (Output Attribution):
    • Target span S0=OS_0 = O (output tokens).
    • Attribute via SpanAttribute(O,1O, 1) to yield scores w(0)w^{(0)} over ITI \cup T.
    • Partition w(0)w^{(0)} into wI(0)w^{(0)}_I (on inputs) and wT(0)w^{(0)}_T (on reasoning tokens).
  2. Hop k>0k > 0 (Recursive Attribution):
    • Use wT(k1)w^{(k-1)}_T from previous hop as new weights on reasoning tokens TT.
    • Aggregate new target: Y(k)=tTwt(k1)yt\mathbf{Y}^{(k)} = \sum_{t\in T} w^{(k-1)}_t \mathbf{y}_t.
    • Compute Zj(k)\mathbf{Z}_j^{(k)} for all jITj \in I \cup T, form new proximity scores w(k)w^{(k)}.
    • Track ρk1=tTwt(k1)\rho_{k-1} = \sum_{t\in T} w^{(k-1)}_t, the fraction of mass on TT to propagate in next hop.
  3. Final Aggregation:

wfinal=wI(0)+ρ0wI(1)+(ρ0ρ1)wI(2)+w_\mathrm{final} = w_I^{(0)} + \rho_0 w_I^{(1)} + (\rho_0 \rho_1) w_I^{(2)} + \cdots

This formula ensures credit is sequentially and faithfully passed from output through reasoning to input tokens.

Empirically, a single recursive hop typically suffices to recover over 90% of attribution mass on true input tokens, with additional hops yielding diminishing returns and slight noise accumulation.

4. FlashTrace Algorithm and Implementation

A sketch of the FlashTrace procedure is as follows:

  1. Cache Model States: Forward pass to store Xin,Xmid,XoutX^{\mathrm{in}}, X^{\mathrm{mid}}, X^{\mathrm{out}}, and all attention α\alpha.
  2. SpanAttribute Routine: For any span SS and per-token weights ww,
    • Aggregate YSmid,YSoutY_S^{\mathrm{mid}}, Y_S^{\mathrm{out}}.
    • For each jj, compute summed weighted attention αjS=iSwiαi,j\alpha^S_j = \sum_{i\in S} w_i \alpha_{i,j}.
    • Calculate CjS=αjSvjC_{j \to S} = \alpha^S_j \cdot v_j.
    • Evaluate proximity to obtain attribution score eje_j.
    • Collect residual/MLP contributions and normalize attributions across tokens.
  3. Recursive Attribution: Initial hop attributes from output; further hops restrict to reasoning tokens and reuse attribution scores.

This design depends on access to cached, full attention matrices (O(LN2)O(LN^2)) and supports efficient, linear-memory attribution at each hop.

5. Empirical Evaluation and Benchmarks

FlashTrace was evaluated across a suite of LLMs and tasks:

  • Models: Qwen-3 8B Instruct (primary), LLaMA-3.1-8B-It (for generalization).
  • Tasks: RULER (Needle-in-a-Haystack, Variable Tracking, long-context HotpotQA), MATH (stepwise solutions), MoreHopQA (multi-hop), and Aider (Python code editing).
  • Metrics: Recovery Rate (ground truth coverage in top 10% attribution), RISE and MAS (log-probability drop after deletion of top-attributed tokens; lower is better).

Observed results:

  • Speed: For M=N=5000M=N=5\,000, FlashTrace completes in ≈20 seconds vs IFR’s 38 minutes, a >130×>130\times speedup.
  • Faithfulness: Recovery Rate, RISE, and MAS indicate superior or state-of-the-art faithfulness across all evaluated regimes; on RULER, FlashTrace achieves Recovery Rates such as 0.075 vs IFR’s 0.012.
  • Trade-offs: Table 3 demonstrates exhaustive IFR token-level rollout requires 11.2 s; FlashTrace achieves 0.72 s with only minor drops in RISE (0.116→0.128) and MAS (0.193→0.205).
  • Ablations: A single recursive hop substantially increases faithfulness (e.g., RISE = 0.127 → 0.128 on MoreHopQA); more hops bring diminishing gains and increased noise.

6. Faithfulness Guarantees and Theoretical Basis

FlashTrace’s proximity metric inherits completeness and conservation from ALTI/IFR: contributions from all components (attention heads, residual, MLP) sum precisely to the L1 change in the target per layer. Span-wise aggregation is mathematically exact for linear operations. The recursive attribution mechanism maintains total mass flow and ensures that causal credit is distributed through all hops to ground-truth inputs. These design principles yield state-of-the-art Recovery Rate and deletion-based faithfulness metrics across a variety of reasoning lengths and architectural configurations.

7. Limitations, Scope, and Future Directions

FlashTrace requires access to full (typically quadratic) attention maps, with corresponding O(LN2)O(LN^2) memory, unless maps are recomputed on demand. Recursion depth should be limited: empirical studies suggest diminishing returns and increasing noise beyond 2–3 hops. The current formulation is designed for autoregressive decoders; adaptation to encoder–decoder, retrieval-augmented, or dynamically normalized architectures is a prospective area for extension. Active research includes further memory optimizations (e.g., on-the-fly attention recomputation), integration with layer normalization, adaptive hop count selection, and tighter faithfulness bounds.


For rigorous, efficient, and faithful long-context interpretability, FlashTrace establishes a new standard for multi-token, chain-sensitive attributions in transformer-based LLMs (Pan et al., 2 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FlashTrace.