Papers
Topics
Authors
Recent
Search
2000 character limit reached

TeCoA: Contrastive Adversarial Training

Updated 3 July 2026
  • TeCoA is a novel adversarial fine-tuning strategy that maintains cross-modal alignment in CLIP models to ensure robust zero-shot performance.
  • It employs a full cross-modal contrastive loss on adversarial examples generated via PGD, fine-tuning only the image encoder while keeping the text encoder fixed.
  • Empirical results show that TeCoA significantly improves adversarial robustness with only a minor drop in clean accuracy compared to standard adversarial training.

TeCoA (Text-guided Contrastive Adversarial training) is an adversarial fine-tuning methodology for CLIP-style vision–LLMs, designed to restore zero-shot adversarial robustness while minimally degrading zero-shot generalization. TeCoA establishes a new defense paradigm by maintaining the alignment between image and text embeddings under adversarial perturbations, thereby addressing the collapse in transferability that arises from standard adversarial training approaches in the context of vision–LLMs (Xu et al., 7 Aug 2025).

1. Motivation and Conceptual Foundation

TeCoA was introduced as the first adversarial fine-tuning strategy to target the unique needs of zero-shot vision–LLMs such as CLIP. In contrast to traditional adversarial training (AT), which replaces contrastive losses with one-hot cross-entropy objectives on clean labels, TeCoA is motivated by the observation that such standard AT severs the crucial alignment between the image encoder (fθf_\theta) and the frozen text encoder (gϕg_\phi). This misalignment leads to an unraveling of the embedding geometry and catastrophic failure of zero-shot performance. TeCoA explicitly restores this alignment by ensuring that even adversarially perturbed images (xadvx_\mathrm{adv}) are mapped closest to their correct text embeddings in the joint embedding space, using a full cross-modal contrastive objective on adversarial examples.

2. Formal Objective and Methodological Implementation

TeCoA operates on pairs of images and classes, (xi,ci)(x_i, c_i), constructing class-level prompts tit_i (e.g., “a photo of a <<class>>”) for the frozen text encoder, gϕ()g_\phi(\cdot). The image encoder fθ()f_\theta(\cdot) is subject to fine-tuning. The methodology consists of the following substantive steps:

  • Adversarial Example Generation: For each input xix_i, adversarial perturbations gϕg_\phi0 are generated via gϕg_\phi1-bounded Projected Gradient Descent (PGD), maximizing the contrastive loss gϕg_\phi2 with gϕg_\phi3, yielding gϕg_\phi4.
  • Contrastive Adversarial Loss: The core fine-tuning objective is a contrastive loss over adversarial samples:

gϕg_\phi5

where gϕg_\phi6 and gϕg_\phi7 is a temperature hyperparameter. Only gϕg_\phi8 is updated; gϕg_\phi9 remains fixed. There are no auxiliary one-hot or cross-entropy terms.

3. Training Regimen

A standard TeCoA training loop involves:

  • Fixing xadvx_\mathrm{adv}0 (e.g., xadvx_\mathrm{adv}1 or xadvx_\mathrm{adv}2 with xadvx_\mathrm{adv}3 norm), xadvx_\mathrm{adv}4 PGD steps (commonly 3–5), step size xadvx_\mathrm{adv}5, and xadvx_\mathrm{adv}6 between 0.01 and 0.07.
  • For each mini-batch:

    1. Pass clean samples xadvx_\mathrm{adv}7 through xadvx_\mathrm{adv}8 and xadvx_\mathrm{adv}9.
    2. Generate (xi,ci)(x_i, c_i)0 by running (xi,ci)(x_i, c_i)1-step PGD to maximize the TeCoA loss with respect to (xi,ci)(x_i, c_i)2.
    3. Compute the TeCoA loss over (xi,ci)(x_i, c_i)3.
    4. Update (xi,ci)(x_i, c_i)4 parameters via the AdamW optimizer (learning rate approximately (xi,ci)(x_i, c_i)5–(xi,ci)(x_i, c_i)6; weight decay approximately (xi,ci)(x_i, c_i)7).
  • Fine-tuning typically proceeds for 3–5 epochs over 50–100K ImageNet samples, with checkpoint selection based on zero-shot accuracy over held-out classes.

4. Geometric and Theoretical Intuition

Zero-shot transfer in CLIP-like models emerges from a hyperspherical embedding space where image and text prototypes are co-located such that nearest-neighbor search enables transfer to unseen classes. Standard adversarial training with one-hot cross-entropy disrupts this geometry, as features learned for a limited set of fine-tuning classes need not align with the text encoder. TeCoA’s approach, exclusively optimizing for image-to-text contrastive alignment even under adversarial attack, actively preserves the original hyperspherical geometry. Every adversarial step (PGD) attempts to push (xi,ci)(x_i, c_i)8 away from its target (xi,ci)(x_i, c_i)9; the optimization step counteracts this push, maintaining geometric integrity and the underlying mechanism for transferability. No direct classification head or one-hot label signal is introduced, sustaining the model’s general alignment capacity.

5. Empirical Evaluation

The original empirical evaluation of TeCoA (Mao et al., ICLR 2023) used ImageNet zero-shot classification under attack, with transfer assessment on 15 unseen downstream datasets (e.g., CIFAR-10/100, Caltech101, SUN397). Attack scenarios included tit_i0-PGD (5 steps, tit_i1) and AutoAttack. The principal metrics were clean zero-shot accuracy and adversarial robust zero-shot accuracy.

Key results (averaged over 5 runs with a prompt pool):

Method ImageNet Clean Acc. ImageNet Robust Acc. Zero-Shot Drop (15 datasets)
CLIP (no defense) 51.5% ≈0% Reference
Cross-entropy AT 18% 30% –22 pp
TeCoA 48% 15% –4 pp

Even at tit_i2, TeCoA maintained approximately 8% robust accuracy, whereas standard AT failed entirely. TeCoA thus achieves an increase in adversarial robustness from near-zero to double-digit percentages, sacrificing only 2–5 percentage points of clean zero-shot performance on ImageNet and an average 4 pp drop across downstream tasks. In contrast, standard AT incurred >20 pp in clean accuracy drop and catastrophic transfer failure (Xu et al., 7 Aug 2025).

6. Trade-offs and Limitations

TeCoA typically introduces a 2–5 percentage point degradation in clean zero-shot accuracy relative to vanilla CLIP. The robust accuracy plateaus at tit_i3–tit_i4; performance gains diminish for larger perturbations (tit_i5). Computational demands are moderate: while less costly than full cross-entropy AT on 1000-class ImageNet, TeCoA necessitates PGD during fine-tuning and is 3–5× slower than clean fine-tuning. Minor inter-class geometric drift may result from fine-tuning on a class subset, limiting transfer to highly divergent domains. Open challenges include reducing clean accuracy degradation to zero, scaling to larger prompt vocabularies or open-vocabulary tasks, and closing the robustness gap to models subjected to full adversarial pre-training (Xu et al., 7 Aug 2025).

7. Context, Impact, and Future Directions

TeCoA established the viability of alignment-preserving adversarial fine-tuning for zero-shot vision–LLMs. By restoring and reinforcing the joint embedding geometry central to CLIP’s zero-shot potential, TeCoA was the first to anchor adversarial defense strategies in the preservation of cross-modal alignment. This approach influenced subsequent paradigms targeting embedding space re-engineering (LAAT, TIMA) and motivated further research into hybrid strategies, input heuristics (AOM, TTC), and latent-space purification (CLIPure). Outstanding research questions pertain to improving clean-robustness trade-offs, extending to open-vocabulary or structured prediction tasks, and integrating TeCoA’s philosophy into broader adversarial training pipelines (Xu et al., 7 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TeCoA.