FCCT: Fine-grained Cross-modal Causal Tracing

Updated 15 November 2025

The paper introduces FCCT, a framework that quantifies causal contributions in LVLMs using a patch-and-trace method, achieving Recovery Rates up to 0.8 in key modules.
FCCT categorizes input tokens into seven groups and measures layer-wise effects across MHSA, FFN, and hidden states, revealing distinct cross-modal information fusion patterns.
FCCT enables Intermediate Representation Injection (IRI) interventions that significantly reduce hallucinations and improve object representation fidelity on multiple benchmarks.

Fine-grained Cross-modal Causal Tracing (FCCT) is a systematic framework for mechanistic interpretability in large vision-LLMs (LVLMs), aiming to precisely quantify and analyze the causal contributions of visual and textual tokens, as well as internal model components, to model output. FCCT enables rigorous attribution of prediction responsibility across individual tokens and modules (multi-head self-attention (MHSA), feed-forward networks (FFNs), hidden states) throughout the network’s decoder layers. Its methodology underpins actionable interventions such as Intermediate Representation Injection (IRI) for hallucination mitigation, yielding quantifiable improvements in faithfulness and perception across multiple LVLM benchmarks (Li et al., 8 Nov 2025).

1. Definitions and Objectives

FCCT addresses the interpretability gap in LVLMs by enabling targeted, component-wise assessment of causal effect. The primary objective is to “open the black box” by measuring, for any input image–prompt pair $\mathbf{x} = (\boldsymbol V, \boldsymbol T)$ , how individual tokens and components contribute to the probability of a target output (e.g., a specific answer in VQA) after a sequence of forward passes:

Clean run: Original input, yielding $P_{\mathrm{clean}}$ .
Corrupted run: Applying isotropic Gaussian noise to the image, yielding $P_{\mathrm{corrupted}}$ .
Patched run: At a specific layer $i$ , component $c$ (MHSA, FFN, or hidden), and token subset $S$ , the activations in the corrupted run are replaced with their clean counterparts at $(i, c, S)$ ; model execution resumes to yield $P_{\mathrm{patched}}$ .

The Recovery Rate (RR) quantifies the normalized gain from this patching operation:

$\mathrm{RR}_{i, c}(S) = \frac{P_{\mathrm{patched}} - P_{\mathrm{corrupted}}}{P_{\mathrm{clean}} - P_{\mathrm{corrupted}}}$

An RR near 1 indicates a strong causal role for the chosen tokens/components at that layer, while a value near 0 implies little influence. This quantitative metric enables fine-grained attribution throughout the model stack.

2. Methodological Framework

The FCCT workflow rigorously interrogates LVLMs as follows:

Token Categorization: The input sequence is partitioned into seven categories: early visual, object visual, late visual, early textual, textual object, late textual, and last token.
Component-wise Patch-and-Trace: For each decoder layer $i$ , model component $c \in \{\mathrm{MHSA}, \mathrm{FFN}, \mathrm{Hidden}\}$ , and token category $S$ , forward passes are run per the protocol described above.
Causal Effect Calculation: RRs are computed for each (layer, component, token) tuple, mapping out the influence landscape throughout the model hierarchy.

Pseudocode for the patching protocol is:

for i in 1..L:
    for component c in {MHSA, FFN, Hidden}:
        for category S in 1..7:
            P_clean = forward(x)              # clean run
            P_corr  = forward(x_noisy)        # corrupted run
            state = run_until_layer(i-1, x_noisy)
            clean_state = run_until_layer(i-1, x)
            state^{(i,c)}_S = clean_state^{(i,c)}_S
            P_patch = continue_forward(state)
            RR[i, c, S] = (P_patch - P_corr) / (P_clean - P_corr)

This approach generalizes to formal Δ-notation by measuring output differences after setting $h^{(i,c)}_S$ to its clean value.

3. Empirical Findings

Applying FCCT to LLaVA-1.5-7B and related architectures reveals a set of distinct mechanistic patterns:

MHSA Last Token Aggregation: The MHSA module at the “last token” (the point of autoregressive prediction) in middle layers (layers 12–18) exhibits the highest causal impact on the output, with $\mathrm{RR} \approx 0.8$ for object visual and textual object tokens. Attention visualization confirms a sharp increase in aggregating information from these token types at layer $\sim$ 15.
FFN Hierarchy: FFNs display a three-stage progression:
- Early layers (0–10): Specialization of object visual token representations.
- Middle layers (11–20): Cross-modal interactions with textual object tokens.
- Deep layers (21–32): Refinement of a task-specific, cross-modal vector via last-token aggregation.
Hidden State Shift: Hidden state semantics shift from purely visual (early) to mixed (mid) and finally to dominantly task-aligned and cross-modal at depth.

The following table summarizes peak RR values:

Component	Token Category	Peak Layer	Peak RR
MHSA (last token)	obj. visual & textual object	~15	~0.80
FFN	visual → textual object prog.	5→15→28	0.4→0.6→0.7
Hidden	hidden state semantic shift	–	0.2→0.8

This mapping elucidates the locus of cross-modal information fusion and its progression through the model.

4. Intervention: Intermediate Representation Injection (IRI)

The empirical bottleneck identified by FCCT—mid-layer last-token MHSA and FFN modules—motivates the IRI technique. IRI is a training-free, inference-time intervention that bolsters mid-layer signals likely responsible for truthful object representation.

The IRI process proceeds as follows:

Source/Target Layer Selection: Using RR curves, select top- $k_1$ source layers with maximal RR and $k_2$ downstream target layers for each component.
Injection Formula: For target layer $l$ :

$\tilde{\boldsymbol a}^{(l)} = \boldsymbol a^{(l)} + \lambda_a \sum_{k \in \mathcal{L}_{\mathrm{src}}^{\mathrm{attn}}\!:k<l} g(k,l)\mathrm{RR}_k^{\mathrm{attn}}\boldsymbol a^{(k)}$

$\tilde{\boldsymbol m}^{(l)} = \boldsymbol m^{(l)} + \lambda_m \sum_{k \in \mathcal{L}_{\mathrm{src}}^{\mathrm{mlp}}\!:k<l} g(k,l)\mathrm{RR}_k^{\mathrm{mlp}}\boldsymbol m^{(k)}$

where $g(k,l)=1$ if $l>k$ , 0 otherwise; $\lambda_a$ , $\lambda_m$ are scaling coefficients.

Norm Preservation: Each injected vector is normalized to match the original $\ell_2$ norm.

Default hyperparameters (LLaVA-1.5-7B): $k_1=3$ , $k_2=10$ , $\lambda_a=0.26$ , $\lambda_m=0.16$ .

IRI yields consistent benchmark improvements:

POPE: +5.2% F1
MME hallucination subset: +65.7 points
CHAIR ( $C_S$ , $C_I$ ): –6.4 average reduction
MMHal-Bench: +0.44 Score, –8.45% hallucination
MHumanEval: –5.1% hallucination

All results are significant ( $p<0.01$ ) and achieved without altering inference speed or requiring retraining.

5. Implications for Model Architecture and Training

The FCCT analysis demonstrates that mid-layer aggregation heads serve as a cross-modal information bottleneck. Potential model design strategies include:

Expanding capacity or adaptive routing in these layers.
Imposing auxiliary loss terms that enforce alignment between clean and noisy activations at key positions/components identified by high RR values.
Automated discovery and manipulation of causal “wires” could be achievable through end-to-end differentiable routing schemes.

By mapping the internal information flow, FCCT provides actionable signals for optimizing network depth, component specialization, and interaction patterns in future LVLM architectures.

6. Extensions, Limitations, and Future Directions

While FCCT is validated using isotropic Gaussian noise as the perturbation for counterfactual analysis, further work could extend the methodology to more ecologically valid corruptions (e.g., spatial occlusion, cropping). A plausible implication is that such extensions would uncover additional vulnerability points and inform robustness-enhancing interventions.

The paradigm of dynamic inference—injecting “trusted” mid-layer states for controlled information flow—generalizes to other modalities, such as audio or video, and to tasks like prompt-editing. Automated, fully differentiable causal tracing and rerouting remain open research directions, with possible applications in model pruning and compact routing.

FCCT thus establishes a principled toolset for interpretable, causally-governed cross-modal modeling and intervention in large-scale vision-language systems (Li et al., 8 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Causal Tracing of Object Representations in Large Vision Language Models: Mechanistic Interpretability and Hallucination Mitigation (2025)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Fine-grained Cross-modal Causal Tracing (FCCT).