CARD Vector in Transformer Models

Updated 20 November 2025

The paper demonstrates that CARD vectors are unit-norm directions in the residual stream that capture key contextual features for model control.
The methodology uses efficient in-layer pooling and linear probing to detect and steer hallucinations with minimal additional computation.
Empirical results show improved accuracy and significant reductions in hallucination metrics, underscoring CARD's potential for causal interventions in model outputs.

The Contextual Activation Residual Direction (CARD) vector is a canonical, linearly accessible direction in the residual-stream activation space of modern Transformer-based models that delineates specific contextual characteristics—such as visual faithfulness or the presence of hallucinations—within generated outputs. CARD vectors provide a practical, low-overhead mechanism for both detecting and steering models’ behaviors at inference time. They have been formalized and empirically validated in the context of large vision-LLMs (LVLMs) for hallucination mitigation, and in language-only settings to separate contextually faithful from hallucinated text. CARD mechanisms are widely applicable due to their model-agnostic formulation, effectiveness in minimizing computational cost, and capacity for causally intervening in internal model representations.

1. Formal Definition and Extraction

A CARD vector is defined as a unit-norm direction in the hidden state (residual stream) at a particular Transformer layer that encodes semantic or contextual structure relevant for model control. The specific instantiation depends on the target characteristic:

LVLM Setting (Zou et al., 13 Nov 2025): For each token $i$ in the prefill span (image tokens + text prompt), let $A^\ell_i \in \mathbb{R}^d$ be the self-attention output (i.e., the residual update) at decoder layer $\ell$ in a pre-norm Transformer. The per-sample CARD vector is

$v_{\text{CARD}} = \frac{\text{Pool}(\{\Delta^\ell_i\}_{i \in \mathcal{T}_{\text{pre}}})} {\lVert \text{Pool}(\{\Delta^\ell_i\}_{i \in \mathcal{T}_{\text{pre}}}) \rVert_2}$

where $\Delta^\ell_i = A^\ell_i$ and Pool is either the simple or norm-weighted mean over the prefill tokens.

LLM (Hallucination) Setting (O'Neill et al., 31 Jul 2025): Given a corpus labeled as contextually faithful or hallucinated, a linear probe (logistic regression) is trained at a selected layer $\ell^*$ . For each example $i$ , extract $h_i = r_{\ell^*}(t^*)$ for the terminal token $t^*$ of the candidate output. The probe’s weight vector $w$ is normalized to obtain $d = w/\lVert w\rVert_2$ , designating the CARD direction.

Empirical studies demonstrate that a single forward pass suffices for extraction, yielding one CARD vector per (image, prompt) pair in LVLMs and a universal direction learned from labeled corpora in LLMs.

2. Computational and Algorithmic Framework

The CARD vector extraction and utilization pipeline is computationally efficient and modular:

Extraction in LVLMs: CARD vectors are obtained at zero additional computational cost during the standard prefill pass by pooling residual updates at a single chosen decoder layer. No specialized masking, gating, or additional model forward calls are needed (Zou et al., 13 Nov 2025).
Linear Probing in LM Hallucination: For detection, the probe’s statistic $s = w^\top h + b$ is computed once per example at inference, requiring only a dot product after the forward pass up to the selected layer. The process is parallelizable and light on memory (O'Neill et al., 31 Jul 2025).
Runtime Steering: At each decoding step, after the residual update (post-attention or pre-MLP), a scaled version of CARD is injected additively into the residual stream. Adaptive gating—crucial for dynamic control—is performed by computing the cosine similarity $s_t$ between the evolving hidden state and $v_{\text{CARD}}$ , then modulating the steering signal according to a Beta-Bernoulli posterior mean $g_t$ , constructed from $s_t$ and model-specific hyperparameters (Zou et al., 13 Nov 2025).
Complexity: CARD extraction and injection have negligible (<5%) impact on throughput; steering requires only inner products, softplus operations, vector additions, and a clamp per decode step.

3. Mechanistic Interpretability and Attribution

CARD vectors localize interpretable, mechanistically grounded axes of contextual informativeness:

Attribution to Sub-Circuits (O'Neill et al., 31 Jul 2025): Gradient-times-activation scoring reveals that the CARD direction arises from highly sparse activity in specific late-layer MLP modules. For Gemma-2-9B, maximal attributions localize to a contiguous 3-layer feed-forward subcircuit immediately before the output, implicating these as the seat of contextual faithfulness/hallucination encoding.
Steering Causality: Empirical ablation studies confirm that manipulating the residual stream along the CARD axis by scalar injection enables bidirectional causal control over hallucination rates at generation time. The effect is pronounced and monotonic with respect to scaling $\alpha$ (O'Neill et al., 31 Jul 2025).

A plausible implication is that the concentration of attribution in late MLPs signals a low-dimensional bottleneck for critical semantic control, supporting targeted interventions.

4. Applications: Hallucination Detection, Mitigation, and Decoding Regulation

CARD vectors form the basis for a range of practical and effective interventions:

Detection: Logistic probe scores over the CARD direction yield high-fidelity contextual hallucination detection in a single forward pass. F1 scores reach 0.96–0.99 on standard news benchmarks, with effective cross-model and cross-domain transfer (O'Neill et al., 31 Jul 2025).
Mitigation/Steering: In LVLMs, adaptive regulation using CARD (within the RUDDER framework) suppresses object hallucination by driving generation toward visually grounded content. Hallucination metrics are improved on benchmarks such as CHAIR and POPE, with performance gains achieved without sacrificing efficiency (Zou et al., 13 Nov 2025).
Decoding Regulation: By dynamically injecting the CARD direction, regulated by per-token alignment to $v_{\text{CARD}}$ , models maintain output consistency with input visual or textual context.
Cross-Task Transfer: CARD vectors trained or extracted on one dataset (e.g., MATH500) transfer effectively to others (e.g., GSM8K), as evidenced by high cosine similarity (≈0.92) and stable performance (Azizi et al., 7 Jul 2025).

5. Empirical Benchmarks and Quantitative Results

Key experimental findings establish the robustness and effectiveness of CARD-based interventions:

Setting / Method	Key Metric	Baseline	CARD-based
LVLM CHAIR (S/I)	Hallucination % (↓)	48.6/13.6	39.5/10.5
LVLM POPE	Accuracy / F1 (↑)	85.3/84.9	86.5/86.0
Hallucination Probe	F1 (news/con. tales)	0.96-0.99/0.70–0.84	-
Throughput	Relative runtime	1.00	0.96

CARD approaches achieve ≈19–23% reductions on open-ended hallucination metrics and ≈1 percentage point improvements in object probing accuracy over strong baselines. Throughput remains at ≈96% of vanilla inference, outperforming methods that require extra forward passes (Zou et al., 13 Nov 2025, O'Neill et al., 31 Jul 2025).

Steering along CARD in LMs yields monotonic control over hallucination rates and repetition, with functional bidirectionality (from 0.3 to 0.86 for hallucination rates as $\alpha$ sweeps from −60 to +60).

6. Design Choices, Hyperparameters, and Best Practices

Robust CARD deployment relies on several key design practices:

Layer Selection: Late decoder layers maximize leverage for steering; per-model tuning suffices (e.g., layers 28–30 in large LVLMs) (Zou et al., 13 Nov 2025).
Pooling Strategy: Mean or norm-weighted means for pooling residual updates in the prefill span yield similar results; empirical evaluation determines optimal choice per setting.
Gating Hyperparameters: The sensitivity parameter $k$ and concentration $c$ in the adaptive Beta gate modulate responsiveness; $g_t$ is kept within prescribed bounds (e.g., $[g_{\min}, g_{\max}]$ ) to avoid oversteering.
Injection Location: Steering is injected post-self-attention at the target decoder layer, and—when applicable—restricted to answer tokens for both efficiency and effect alignment.
No Additional Forward Passes: CARD extraction and injection require no modification of the training process or extra inference-time model evaluations, significantly reducing deployment complexity.

7. Limitations, Transfer, and Outlook

While CARD demonstrates consistently high performance and transferability across datasets and model sizes, its effectiveness depends on the presence of a low-dimensional latent encoding for the relevant contextual feature. Unsupervised adaptation (e.g., SFT on in-domain correct samples) further elevates detection F1, suggesting that domain-specific fine-tuning can enhance robustness (O'Neill et al., 31 Jul 2025).

A plausible implication is that as models and domains diversify, future research may explore multi-dimensional steering and the relationship between CARD and internal model interpretability, opening avenues for even more precise causal interventions in model behavior, both for hallucination reduction and broader alignment tasks.