TeCoA: Contrastive Adversarial Training
- TeCoA is a novel adversarial fine-tuning strategy that maintains cross-modal alignment in CLIP models to ensure robust zero-shot performance.
- It employs a full cross-modal contrastive loss on adversarial examples generated via PGD, fine-tuning only the image encoder while keeping the text encoder fixed.
- Empirical results show that TeCoA significantly improves adversarial robustness with only a minor drop in clean accuracy compared to standard adversarial training.
TeCoA (Text-guided Contrastive Adversarial training) is an adversarial fine-tuning methodology for CLIP-style vision–LLMs, designed to restore zero-shot adversarial robustness while minimally degrading zero-shot generalization. TeCoA establishes a new defense paradigm by maintaining the alignment between image and text embeddings under adversarial perturbations, thereby addressing the collapse in transferability that arises from standard adversarial training approaches in the context of vision–LLMs (Xu et al., 7 Aug 2025).
1. Motivation and Conceptual Foundation
TeCoA was introduced as the first adversarial fine-tuning strategy to target the unique needs of zero-shot vision–LLMs such as CLIP. In contrast to traditional adversarial training (AT), which replaces contrastive losses with one-hot cross-entropy objectives on clean labels, TeCoA is motivated by the observation that such standard AT severs the crucial alignment between the image encoder () and the frozen text encoder (). This misalignment leads to an unraveling of the embedding geometry and catastrophic failure of zero-shot performance. TeCoA explicitly restores this alignment by ensuring that even adversarially perturbed images () are mapped closest to their correct text embeddings in the joint embedding space, using a full cross-modal contrastive objective on adversarial examples.
2. Formal Objective and Methodological Implementation
TeCoA operates on pairs of images and classes, , constructing class-level prompts (e.g., “a photo of a class”) for the frozen text encoder, . The image encoder is subject to fine-tuning. The methodology consists of the following substantive steps:
- Adversarial Example Generation: For each input , adversarial perturbations 0 are generated via 1-bounded Projected Gradient Descent (PGD), maximizing the contrastive loss 2 with 3, yielding 4.
- Contrastive Adversarial Loss: The core fine-tuning objective is a contrastive loss over adversarial samples:
5
where 6 and 7 is a temperature hyperparameter. Only 8 is updated; 9 remains fixed. There are no auxiliary one-hot or cross-entropy terms.
3. Training Regimen
A standard TeCoA training loop involves:
- Fixing 0 (e.g., 1 or 2 with 3 norm), 4 PGD steps (commonly 3–5), step size 5, and 6 between 0.01 and 0.07.
- For each mini-batch:
- Pass clean samples 7 through 8 and 9.
- Generate 0 by running 1-step PGD to maximize the TeCoA loss with respect to 2.
- Compute the TeCoA loss over 3.
- Update 4 parameters via the AdamW optimizer (learning rate approximately 5–6; weight decay approximately 7).
Fine-tuning typically proceeds for 3–5 epochs over 50–100K ImageNet samples, with checkpoint selection based on zero-shot accuracy over held-out classes.
4. Geometric and Theoretical Intuition
Zero-shot transfer in CLIP-like models emerges from a hyperspherical embedding space where image and text prototypes are co-located such that nearest-neighbor search enables transfer to unseen classes. Standard adversarial training with one-hot cross-entropy disrupts this geometry, as features learned for a limited set of fine-tuning classes need not align with the text encoder. TeCoA’s approach, exclusively optimizing for image-to-text contrastive alignment even under adversarial attack, actively preserves the original hyperspherical geometry. Every adversarial step (PGD) attempts to push 8 away from its target 9; the optimization step counteracts this push, maintaining geometric integrity and the underlying mechanism for transferability. No direct classification head or one-hot label signal is introduced, sustaining the model’s general alignment capacity.
5. Empirical Evaluation
The original empirical evaluation of TeCoA (Mao et al., ICLR 2023) used ImageNet zero-shot classification under attack, with transfer assessment on 15 unseen downstream datasets (e.g., CIFAR-10/100, Caltech101, SUN397). Attack scenarios included 0-PGD (5 steps, 1) and AutoAttack. The principal metrics were clean zero-shot accuracy and adversarial robust zero-shot accuracy.
Key results (averaged over 5 runs with a prompt pool):
| Method | ImageNet Clean Acc. | ImageNet Robust Acc. | Zero-Shot Drop (15 datasets) |
|---|---|---|---|
| CLIP (no defense) | 51.5% | ≈0% | Reference |
| Cross-entropy AT | 18% | 30% | –22 pp |
| TeCoA | 48% | 15% | –4 pp |
Even at 2, TeCoA maintained approximately 8% robust accuracy, whereas standard AT failed entirely. TeCoA thus achieves an increase in adversarial robustness from near-zero to double-digit percentages, sacrificing only 2–5 percentage points of clean zero-shot performance on ImageNet and an average 4 pp drop across downstream tasks. In contrast, standard AT incurred >20 pp in clean accuracy drop and catastrophic transfer failure (Xu et al., 7 Aug 2025).
6. Trade-offs and Limitations
TeCoA typically introduces a 2–5 percentage point degradation in clean zero-shot accuracy relative to vanilla CLIP. The robust accuracy plateaus at 3–4; performance gains diminish for larger perturbations (5). Computational demands are moderate: while less costly than full cross-entropy AT on 1000-class ImageNet, TeCoA necessitates PGD during fine-tuning and is 3–5× slower than clean fine-tuning. Minor inter-class geometric drift may result from fine-tuning on a class subset, limiting transfer to highly divergent domains. Open challenges include reducing clean accuracy degradation to zero, scaling to larger prompt vocabularies or open-vocabulary tasks, and closing the robustness gap to models subjected to full adversarial pre-training (Xu et al., 7 Aug 2025).
7. Context, Impact, and Future Directions
TeCoA established the viability of alignment-preserving adversarial fine-tuning for zero-shot vision–LLMs. By restoring and reinforcing the joint embedding geometry central to CLIP’s zero-shot potential, TeCoA was the first to anchor adversarial defense strategies in the preservation of cross-modal alignment. This approach influenced subsequent paradigms targeting embedding space re-engineering (LAAT, TIMA) and motivated further research into hybrid strategies, input heuristics (AOM, TTC), and latent-space purification (CLIPure). Outstanding research questions pertain to improving clean-robustness trade-offs, extending to open-vocabulary or structured prediction tasks, and integrating TeCoA’s philosophy into broader adversarial training pipelines (Xu et al., 7 Aug 2025).