Papers
Topics
Authors
Recent
Search
2000 character limit reached

Typographic Circuit Ablation in CLIP

Updated 27 November 2025
  • The paper introduces a training-free typographic circuit ablation method that selectively disables typographic pathways in CLIP to counter adversarial text attacks.
  • It employs linear probes and the Typographic Attention Score (TAS) to identify and iteratively select specialized attention heads within the Vision Transformer for targeted ablation.
  • Experimental evaluations show significant improvements in typo-robustness (5–32 percentage points) with minimal impact on standard zero-shot performance across various datasets.

Typographic circuit ablation is a mechanistic intervention technique for defending vision-LLMs—specifically CLIP—against typographic attacks. These attacks exploit the model's propensity to extract and utilize visually embedded text, leading to targeted misclassification, content manipulation, and model jailbreaks. The typographic circuit ablation method selectively disrupts causal pathways that transmit typographic signals from image regions to the model's classification token, thereby improving robustness to attacks while maintaining standard vision performance (Hufe et al., 28 Aug 2025).

1. Identification of Typographic Circuits in Vision Encoders

The procedure begins by identifying attention heads within the Vision Transformer (ViT) backbone of CLIP that act as typographic specialists. Linear probes Ptypo,ℓP_{\mathrm{typo},\ell} on the CLS embedding hℓ(x)h_\ell(x) indicate typographic information is linearly decodable only in the latter half of the ViT layers. To quantify head specialization, the Typographic Attention Score (TAS) Ti,ℓT_{i,\ell} is introduced. Given an attention head Hi,ℓ\mathcal{H}_{i,\ell} and a dataset D\mathcal{D} with binary region masks $\mathds{1}(t)$, TAS is defined by: $T_{i,\ell} = \sum_{x \in \mathcal{D}} \sum_{t=1}^T \mathds{1}(t) A^*_{i,\ell}(x)_t$ where Ai,ℓ(x)A_{i,\ell}(x) is the attention pattern, and Ai,ℓ∗(x)A^*_{i,\ell}(x) denotes the spatial attention component from CLS. Heads with Ti,ℓT_{i,\ell} significantly exceeding (≥3×\ge 3\times mean) are deemed "typographic-specialist" heads. Optionally, the causal influence of each head can be quantified by measuring the drop in typographic-attack classification accuracy when that head’s CLS contribution is zeroed.

2. Mechanism and Implementation of Ablation

Typographic circuit ablation operates via residual-stream intervention in ViT architectures. For a layer ℓ\ell, the standard CLS residual update: hclsℓ=hclsℓ−1+∑i=1IHi,ℓ(x)cls+(MLP contribution)h^\ell_{\mathrm{cls}} = h^{\ell-1}_{\mathrm{cls}} + \sum_{i=1}^I \mathcal{H}_{i,\ell}(x)_{\mathrm{cls}} + \text{(MLP contribution)} is modified for a chosen circuit C⊂{Hi,ℓ}\mathcal{C} \subset \{\mathcal{H}_{i,\ell}\} by zeroing the CLS updates: Hi,ℓ(x)cls↦0,∀ Hi,ℓ∈C\mathcal{H}_{i,\ell}(x)_{\mathrm{cls}} \mapsto 0,\quad \forall\,\mathcal{H}_{i,\ell}\in\mathcal{C} Spatial-to-spatial contributions remain untouched, isolating the intervention to topology-to-classification pathways. This mechanism is operationalized without any model fine-tuning.

3. Circuit Construction Algorithm

The selection of the typographic circuit is iterative and data-driven. For each attention head, compute Ti,ℓT_{i,\ell} and sort all heads by descending score. Initialize C=∅\mathcal{C} = \emptyset. Sequentially add heads to C\mathcal{C}, and after each addition, evaluate the relative drop in standard image accuracy (ΔAcc\Delta\mathrm{Acc}) on a clean dataset. If ΔAcc≥ϵ\Delta\mathrm{Acc} \ge \epsilon, exclude the head and break. The final circuit comprises all heads added before the accuracy threshold is violated.

The pseudocode is:

1
2
3
4
5
6
7
For each head H in order of decreasing T_{i,â„“}:
    Add H to C
    Evaluate accuracy drop ΔAcc on clean images
    If ΔAcc ≥ ε:
        Remove H from C
        Break
Return C

4. Experimental Evaluation

Empirical analysis utilizes both synthetic and real-world typographic datasets:

  • Unsplash-Typo (synthetic): Used for scoring attention heads.
  • ImageNet-100-Typo: Typographic attack subset for probe training and circuit construction.
  • RTA-100, Disentangling, PAINT-DS: Real-world benchmarks.
  • Aircraft, Food-101, ImageNet-100 (clean): Used for zero-shot performance assessment.

Zero-shot classification accuracy serves as the evaluation metric. Key results for ViT-B dyslexic (ablated) model:

  • ImageNet-100-Typo: 66.50% (baseline 46.90%; +19.60 pp)
  • RTA-100: 67.70% (+8.50 pp)
  • Disentangling: 88.33% (+32.78 pp)
  • PAINT-DS: 65.45% (+6.36 pp)
  • ImageNet-100 (clean): 75.26% (+0.50 pp)
  • Aircraft: 27.48% (+1.59 pp)
  • Food-101: 86.23% (±0.00 pp)

Trade-off analysis shows typo-accuracy increases monotonically as more typographic heads are ablated, while clean accuracy remains stable. The resulting circuits are sparse (e.g., 6/144 = 4.2% of heads for ViT-B).

5. Generation and Properties of Dyslexic CLIP Models

Dyslexic CLIP models are derived from pretrained OpenCLIP ViT-X (B, L, H, G, Big-G) by:

  1. Scoring all heads on Unsplash-Typo.
  2. Constructing the ablation circuit with ϵ=0.01\epsilon = 0.01 using 5% of clean ImageNet-100.
  3. Applying the ablation mask at inference.

No weights or parameters are modified; only the masked heads have their CLS contributions zeroed. These variants are drop-in replacements for standard models.

Robustness to typographic attacks improves by 5–32 percentage points across model scales, and standard zero-shot performance drops by less than 1 percentage point. Compared to the Defense-Prefix (fine-tuned prefix) approach, the training-free ablated models achieve competitive typo-robustness and superior accuracy on out-of-domain clean datasets such as Aircraft and Food-101.

6. Significance and Application Scope

Typographic circuit ablation offers a mechanistic, training-free defense against text-based adversarial manipulations in multimodal models. By isolating and disabling typographic circuits, robust zero-shot classification is achieved with negligible impact on standard vision tasks. The method is particularly relevant for safety-critical deployments where text region exploitation poses systemic risks, and the utility of intentional text recognition is secondary to system integrity. The availability of dyslexic CLIP models as drop-in replacements further enhances applicability in high-assurance contexts (Hufe et al., 28 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Typographic Circuit Ablation.