Typographic Circuit Ablation in CLIP
- The paper introduces a training-free typographic circuit ablation method that selectively disables typographic pathways in CLIP to counter adversarial text attacks.
- It employs linear probes and the Typographic Attention Score (TAS) to identify and iteratively select specialized attention heads within the Vision Transformer for targeted ablation.
- Experimental evaluations show significant improvements in typo-robustness (5–32 percentage points) with minimal impact on standard zero-shot performance across various datasets.
Typographic circuit ablation is a mechanistic intervention technique for defending vision-LLMs—specifically CLIP—against typographic attacks. These attacks exploit the model's propensity to extract and utilize visually embedded text, leading to targeted misclassification, content manipulation, and model jailbreaks. The typographic circuit ablation method selectively disrupts causal pathways that transmit typographic signals from image regions to the model's classification token, thereby improving robustness to attacks while maintaining standard vision performance (Hufe et al., 28 Aug 2025).
1. Identification of Typographic Circuits in Vision Encoders
The procedure begins by identifying attention heads within the Vision Transformer (ViT) backbone of CLIP that act as typographic specialists. Linear probes on the CLS embedding indicate typographic information is linearly decodable only in the latter half of the ViT layers. To quantify head specialization, the Typographic Attention Score (TAS) is introduced. Given an attention head and a dataset with binary region masks $\mathds{1}(t)$, TAS is defined by: $T_{i,\ell} = \sum_{x \in \mathcal{D}} \sum_{t=1}^T \mathds{1}(t) A^*_{i,\ell}(x)_t$ where is the attention pattern, and denotes the spatial attention component from CLS. Heads with significantly exceeding ( mean) are deemed "typographic-specialist" heads. Optionally, the causal influence of each head can be quantified by measuring the drop in typographic-attack classification accuracy when that head’s CLS contribution is zeroed.
2. Mechanism and Implementation of Ablation
Typographic circuit ablation operates via residual-stream intervention in ViT architectures. For a layer , the standard CLS residual update: is modified for a chosen circuit by zeroing the CLS updates: Spatial-to-spatial contributions remain untouched, isolating the intervention to topology-to-classification pathways. This mechanism is operationalized without any model fine-tuning.
3. Circuit Construction Algorithm
The selection of the typographic circuit is iterative and data-driven. For each attention head, compute and sort all heads by descending score. Initialize . Sequentially add heads to , and after each addition, evaluate the relative drop in standard image accuracy () on a clean dataset. If , exclude the head and break. The final circuit comprises all heads added before the accuracy threshold is violated.
The pseudocode is:
1 2 3 4 5 6 7 |
For each head H in order of decreasing T_{i,â„“}:
Add H to C
Evaluate accuracy drop ΔAcc on clean images
If ΔAcc ≥ ε:
Remove H from C
Break
Return C |
4. Experimental Evaluation
Empirical analysis utilizes both synthetic and real-world typographic datasets:
- Unsplash-Typo (synthetic): Used for scoring attention heads.
- ImageNet-100-Typo: Typographic attack subset for probe training and circuit construction.
- RTA-100, Disentangling, PAINT-DS: Real-world benchmarks.
- Aircraft, Food-101, ImageNet-100 (clean): Used for zero-shot performance assessment.
Zero-shot classification accuracy serves as the evaluation metric. Key results for ViT-B dyslexic (ablated) model:
- ImageNet-100-Typo: 66.50% (baseline 46.90%; +19.60 pp)
- RTA-100: 67.70% (+8.50 pp)
- Disentangling: 88.33% (+32.78 pp)
- PAINT-DS: 65.45% (+6.36 pp)
- ImageNet-100 (clean): 75.26% (+0.50 pp)
- Aircraft: 27.48% (+1.59 pp)
- Food-101: 86.23% (±0.00 pp)
Trade-off analysis shows typo-accuracy increases monotonically as more typographic heads are ablated, while clean accuracy remains stable. The resulting circuits are sparse (e.g., 6/144 = 4.2% of heads for ViT-B).
5. Generation and Properties of Dyslexic CLIP Models
Dyslexic CLIP models are derived from pretrained OpenCLIP ViT-X (B, L, H, G, Big-G) by:
- Scoring all heads on Unsplash-Typo.
- Constructing the ablation circuit with using 5% of clean ImageNet-100.
- Applying the ablation mask at inference.
No weights or parameters are modified; only the masked heads have their CLS contributions zeroed. These variants are drop-in replacements for standard models.
Robustness to typographic attacks improves by 5–32 percentage points across model scales, and standard zero-shot performance drops by less than 1 percentage point. Compared to the Defense-Prefix (fine-tuned prefix) approach, the training-free ablated models achieve competitive typo-robustness and superior accuracy on out-of-domain clean datasets such as Aircraft and Food-101.
6. Significance and Application Scope
Typographic circuit ablation offers a mechanistic, training-free defense against text-based adversarial manipulations in multimodal models. By isolating and disabling typographic circuits, robust zero-shot classification is achieved with negligible impact on standard vision tasks. The method is particularly relevant for safety-critical deployments where text region exploitation poses systemic risks, and the utility of intentional text recognition is secondary to system integrity. The availability of dyslexic CLIP models as drop-in replacements further enhances applicability in high-assurance contexts (Hufe et al., 28 Aug 2025).