Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 130 tok/s
Gemini 3.0 Pro 29 tok/s Pro
Gemini 2.5 Flash 145 tok/s Pro
Kimi K2 191 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

FCCT: Fine-grained Cross-modal Causal Tracing

Updated 15 November 2025
  • The paper introduces FCCT, a framework that quantifies causal contributions in LVLMs using a patch-and-trace method, achieving Recovery Rates up to 0.8 in key modules.
  • FCCT categorizes input tokens into seven groups and measures layer-wise effects across MHSA, FFN, and hidden states, revealing distinct cross-modal information fusion patterns.
  • FCCT enables Intermediate Representation Injection (IRI) interventions that significantly reduce hallucinations and improve object representation fidelity on multiple benchmarks.

Fine-grained Cross-modal Causal Tracing (FCCT) is a systematic framework for mechanistic interpretability in large vision-LLMs (LVLMs), aiming to precisely quantify and analyze the causal contributions of visual and textual tokens, as well as internal model components, to model output. FCCT enables rigorous attribution of prediction responsibility across individual tokens and modules (multi-head self-attention (MHSA), feed-forward networks (FFNs), hidden states) throughout the network’s decoder layers. Its methodology underpins actionable interventions such as Intermediate Representation Injection (IRI) for hallucination mitigation, yielding quantifiable improvements in faithfulness and perception across multiple LVLM benchmarks (Li et al., 8 Nov 2025).

1. Definitions and Objectives

FCCT addresses the interpretability gap in LVLMs by enabling targeted, component-wise assessment of causal effect. The primary objective is to “open the black box” by measuring, for any input image–prompt pair x=(V,T)\mathbf{x} = (\boldsymbol V, \boldsymbol T), how individual tokens and components contribute to the probability of a target output (e.g., a specific answer in VQA) after a sequence of forward passes:

  1. Clean run: Original input, yielding PcleanP_{\mathrm{clean}}.
  2. Corrupted run: Applying isotropic Gaussian noise to the image, yielding PcorruptedP_{\mathrm{corrupted}}.
  3. Patched run: At a specific layer ii, component cc (MHSA, FFN, or hidden), and token subset SS, the activations in the corrupted run are replaced with their clean counterparts at (i,c,S)(i, c, S); model execution resumes to yield PpatchedP_{\mathrm{patched}}.

The Recovery Rate (RR) quantifies the normalized gain from this patching operation:

RRi,c(S)=PpatchedPcorruptedPcleanPcorrupted\mathrm{RR}_{i, c}(S) = \frac{P_{\mathrm{patched}} - P_{\mathrm{corrupted}}}{P_{\mathrm{clean}} - P_{\mathrm{corrupted}}}

An RR near 1 indicates a strong causal role for the chosen tokens/components at that layer, while a value near 0 implies little influence. This quantitative metric enables fine-grained attribution throughout the model stack.

2. Methodological Framework

The FCCT workflow rigorously interrogates LVLMs as follows:

  1. Token Categorization: The input sequence is partitioned into seven categories: early visual, object visual, late visual, early textual, textual object, late textual, and last token.
  2. Component-wise Patch-and-Trace: For each decoder layer ii, model component c{MHSA,FFN,Hidden}c \in \{\mathrm{MHSA}, \mathrm{FFN}, \mathrm{Hidden}\}, and token category SS, forward passes are run per the protocol described above.
  3. Causal Effect Calculation: RRs are computed for each (layer, component, token) tuple, mapping out the influence landscape throughout the model hierarchy.

Pseudocode for the patching protocol is:

1
2
3
4
5
6
7
8
9
10
for i in 1..L:
    for component c in {MHSA, FFN, Hidden}:
        for category S in 1..7:
            P_clean = forward(x)              # clean run
            P_corr  = forward(x_noisy)        # corrupted run
            state = run_until_layer(i-1, x_noisy)
            clean_state = run_until_layer(i-1, x)
            state^{(i,c)}_S = clean_state^{(i,c)}_S
            P_patch = continue_forward(state)
            RR[i, c, S] = (P_patch - P_corr) / (P_clean - P_corr)

This approach generalizes to formal Δ-notation by measuring output differences after setting hS(i,c)h^{(i,c)}_S to its clean value.

3. Empirical Findings

Applying FCCT to LLaVA-1.5-7B and related architectures reveals a set of distinct mechanistic patterns:

  • MHSA Last Token Aggregation: The MHSA module at the “last token” (the point of autoregressive prediction) in middle layers (layers 12–18) exhibits the highest causal impact on the output, with RR0.8\mathrm{RR} \approx 0.8 for object visual and textual object tokens. Attention visualization confirms a sharp increase in aggregating information from these token types at layer \sim15.
  • FFN Hierarchy: FFNs display a three-stage progression:
    • Early layers (0–10): Specialization of object visual token representations.
    • Middle layers (11–20): Cross-modal interactions with textual object tokens.
    • Deep layers (21–32): Refinement of a task-specific, cross-modal vector via last-token aggregation.
  • Hidden State Shift: Hidden state semantics shift from purely visual (early) to mixed (mid) and finally to dominantly task-aligned and cross-modal at depth.

The following table summarizes peak RR values:

Component Token Category Peak Layer Peak RR
MHSA (last token) obj. visual & textual object ~15 ~0.80
FFN visual → textual object prog. 5→15→28 0.4→0.6→0.7
Hidden hidden state semantic shift 0.2→0.8

This mapping elucidates the locus of cross-modal information fusion and its progression through the model.

4. Intervention: Intermediate Representation Injection (IRI)

The empirical bottleneck identified by FCCT—mid-layer last-token MHSA and FFN modules—motivates the IRI technique. IRI is a training-free, inference-time intervention that bolsters mid-layer signals likely responsible for truthful object representation.

The IRI process proceeds as follows:

  1. Source/Target Layer Selection: Using RR curves, select top-k1k_1 source layers with maximal RR and k2k_2 downstream target layers for each component.
  2. Injection Formula: For target layer ll:

a~(l)=a(l)+λakLsrcattn ⁣:k<lg(k,l)RRkattna(k)\tilde{\boldsymbol a}^{(l)} = \boldsymbol a^{(l)} + \lambda_a \sum_{k \in \mathcal{L}_{\mathrm{src}}^{\mathrm{attn}}\!:k<l} g(k,l)\mathrm{RR}_k^{\mathrm{attn}}\boldsymbol a^{(k)}

m~(l)=m(l)+λmkLsrcmlp ⁣:k<lg(k,l)RRkmlpm(k)\tilde{\boldsymbol m}^{(l)} = \boldsymbol m^{(l)} + \lambda_m \sum_{k \in \mathcal{L}_{\mathrm{src}}^{\mathrm{mlp}}\!:k<l} g(k,l)\mathrm{RR}_k^{\mathrm{mlp}}\boldsymbol m^{(k)}

where g(k,l)=1g(k,l)=1 if l>kl>k, 0 otherwise; λa\lambda_a, λm\lambda_m are scaling coefficients.

  1. Norm Preservation: Each injected vector is normalized to match the original 2\ell_2 norm.

Default hyperparameters (LLaVA-1.5-7B): k1=3k_1=3, k2=10k_2=10, λa=0.26\lambda_a=0.26, λm=0.16\lambda_m=0.16.

IRI yields consistent benchmark improvements:

  • POPE: +5.2% F1
  • MME hallucination subset: +65.7 points
  • CHAIR (CSC_S,CIC_I): –6.4 average reduction
  • MMHal-Bench: +0.44 Score, –8.45% hallucination
  • MHumanEval: –5.1% hallucination

All results are significant (p<0.01p<0.01) and achieved without altering inference speed or requiring retraining.

5. Implications for Model Architecture and Training

The FCCT analysis demonstrates that mid-layer aggregation heads serve as a cross-modal information bottleneck. Potential model design strategies include:

  • Expanding capacity or adaptive routing in these layers.
  • Imposing auxiliary loss terms that enforce alignment between clean and noisy activations at key positions/components identified by high RR values.
  • Automated discovery and manipulation of causal “wires” could be achievable through end-to-end differentiable routing schemes.

By mapping the internal information flow, FCCT provides actionable signals for optimizing network depth, component specialization, and interaction patterns in future LVLM architectures.

6. Extensions, Limitations, and Future Directions

While FCCT is validated using isotropic Gaussian noise as the perturbation for counterfactual analysis, further work could extend the methodology to more ecologically valid corruptions (e.g., spatial occlusion, cropping). A plausible implication is that such extensions would uncover additional vulnerability points and inform robustness-enhancing interventions.

The paradigm of dynamic inference—injecting “trusted” mid-layer states for controlled information flow—generalizes to other modalities, such as audio or video, and to tasks like prompt-editing. Automated, fully differentiable causal tracing and rerouting remain open research directions, with possible applications in model pruning and compact routing.

FCCT thus establishes a principled toolset for interpretable, causally-governed cross-modal modeling and intervention in large-scale vision-language systems (Li et al., 8 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Fine-grained Cross-modal Causal Tracing (FCCT).