Fine-Tuned CLIP Models

Updated 13 September 2025

Fine-tuned CLIP models are adaptations of large-scale vision–language pre-trained models that optimize performance by targeted parameter updates and domain-specific techniques.
They leverage methods such as mixup-based hard negative generation, context-aware regularization, and adversarial fine-tuning to improve embedding uniformity, calibration, and robustness.
Parameter-efficient strategies and post-hoc knowledge manipulation further enhance transferability and allow for selective adaptation in specialized tasks.

Fine-tuned CLIP models are adaptations of large-scale vision–language pre-trained models, specifically CLIP, to task- and domain-specific scenarios through parameter optimization, algorithmic enhancements, data curation, or architectural manipulation. These techniques aim to improve accuracy, robustness, transferability, or calibration on specialized or difficult tasks, often addressing the limitations observed during naïve fine-tuning, such as poor generalization, context forgetting, modality misalignment, calibration drift, and susceptibility to adversarial manipulations. This article surveys advanced fine-tuning methodologies, including mixup-based hard negative generation, context-aware regularization, task-driven modifications, robustness-oriented objectives, prompt/pseudolabel-based methods, and targeted knowledge manipulation—each underpinned by explicit mathematical formulations and experimental evidence.

1. Fine-Tuning Objectives and Uniformity–Alignment Tradeoff

The geometric properties of CLIP's L₂-normalized joint embedding space create both utility and limitations for downstream adaptation. Empirical studies reveal that standard CLIP fine-tuning often maintains distinct clusters for image and text modalities, resulting in poor uniformity and alignment (i.e., embeddings do not fill the hypersphere nor tightly couple paired samples) (Oh et al., 2022). This deficiency restricts transferability and limits robustness to domain shift or rare classes.

To explicitly address this, the geodesic multi-modal mixup (m²-Mix) technique interpolates between image and text embeddings along great circles on the hypersphere: $m_{λ}(a, b) = a \cdot \frac{\sin(λ\theta)}{\sin \theta} + b \cdot \frac{\sin((1-λ)\theta)}{\sin \theta}$ with $\theta = \cos^{-1}(a \cdot b)$ , $a,b$ as L₂-normalized vectors, and $λ \sim \mathrm{Beta}(\alpha, \alpha)$ . These mixed “hard negatives” fill the unexploited region between the modality clusters, enforcing more uniform and better-aligned distributions when incorporated into the contrastive loss. Empirical results show substantially improved top-1/top-5 recall on retrieval (Flickr30k, MS COCO), few-/zero-shot classification (under distribution shift), lower Expected Calibration Error (ECE), and richer embedding arithmetic, supporting the contention that hard negative mixup robustifies and renders CLIP more transferable.

Theoretical analysis using von Mises–Fisher distributions confirms that these interpolated samples represent “harder” negatives (i.e., are closer to positives than ordinary negatives in Kullback–Leibler divergence), further grounding the empirical gains (Oh et al., 2022).

2. Context Preservation and Regularization under Fine-Tuning

While vanilla fine-tuning can boost in-distribution accuracy, it typically corrupts the context-aware reasoning CLIP acquires during pre-training—evidenced by a dramatic fall in context-recognition accuracy upon naïve optimization (Mao et al., 2022). The Context-Aware Robust Fine-Tuning (CAR-FT) framework addresses this directly by regularizing the drift of context distributions between the pre-trained and fine-tuned models: $\mathcal{L}_{\mathrm{KL}} = \mathrm{KL}[p_{\mathrm{ctx}}(x; \theta) \| p_{\mathrm{ctx}}(x; \hat{\theta})]$ with $p_{\mathrm{ctx}}(x; \theta) = \mathrm{Softmax}(W_{\mathrm{ctx}}^\top h_\theta(x))$ , where $W_{\mathrm{ctx}}$ encodes context prompts derived from the fixed text encoder. The total loss combines the usual classifier cross-entropy and $\mathcal{L}_{\mathrm{KL}}$ with a balancing hyperparameter.

CAR-FT strictly preserves the multi-contextual features acquired during contrastive pre-training, yielding both higher in-distribution (ID) (e.g., top-1 83.3% vs. 81.0% for plain FT on ImageNet) and substantial out-of-distribution (OOD) performance gain (56.9% vs. 45.1% averaged OOD accuracy). On domain generalization benchmarks (DomainBed), CAR-FT achieves a new state-of-the-art (78.5% average), indicating that explicit context preservation via KL regularization is a practical countermeasure to the loss of robustness typically observed in downstream fine-tuning (Mao et al., 2022).

3. Task-Driven Adaptation and Efficient Design

Fine-tuned CLIP models are leveraged as strong baselines for video learning and other application domains in which the pre-training data or architecture do not natively align with the target (e.g., temporal information in video or few-shot adaptation). In video settings (Rasheed et al., 2022), CLIP's image encoder is applied framewise, then the temporal axis is pooled (usually via mean aggregation) to create an aggregate video embedding, which is used in standard cross-modal contrastive learning with text labels. Empirical assessments on video benchmarks (Kinetics-400, UCF-101, HMDB-51) show that this simple frame-level pooling and fine-tuning delivers competitive to superior performance compared to architectures with explicit temporal modeling, especially in zero-/few-shot settings, and with lower computational complexity and better throughput.

To further optimize for low-data regimes, a “bridge and prompt” strategy is deployed wherein CLIP is first fine-tuned on a video dataset (“bridge” phase), and then only lightweight prompt tokens are learned (“prompt” phase), avoiding the risks of overfitting through full model adaptation while still aligning vision-language representations for the video domain.

4. Robustness, Security, and Calibration Enhancements

Robustness to adversarial inputs and data poisoning is a critical dimension for fine-tuned CLIP models—both as encoders for downstream vision-LLMs (VLMs) and as stand-alone classifiers. Unsupervised adversarial fine-tuning (FARE) (Schlarmann et al., 19 Feb 2024) minimizes the deviation between original and adversarial embeddings under $\ell_\infty$ -bounded perturbations: $L_{\mathrm{FARE}}(\phi, x) = \max_{z: \|z-x\|_\infty \leq \epsilon} \|\phi(z) - \phi_0(x)\|_2^2$ ensuring that cosine similarities crucial for vision–language inference remain stable under attack, as

$|\cos(\phi_0(x), \psi(t)) - \cos(\phi(x), \psi(t))| \leq \left(\tfrac{1}{\|\phi_0(x)\|_2} + \tfrac{1}{\|\phi(x)\|_2}\right)\|\phi_0(x) - \phi(x)\|_2$

Such adversarially “immunized” CLIP vision modules can then be swapped into VLMs (e.g., LLaVA, OpenFlamingo) to preclude stealthy manipulation without any retraining of the language or multi-modal heads.

CleanCLIP (Bansal et al., 2023) addresses data poisoning (backdoor) attacks by combining the standard multimodal contrastive objective with unimodal self-supervision. By reinforcing each modality’s representation independently, CleanCLIP weakens spurious, trigger-induced alignments, reducing attack success rates by an order of magnitude and restoring benign accuracy to near baseline levels.

Calibration under open-vocabulary, prompt-based fine-tuning is another pressing reliability issue. Both Distance-Aware Calibration (DAC) (Wang et al., 7 Feb 2024) and Contrast-Aware Calibration (CAC) (Lv et al., 31 Jan 2025) provide post-hoc reweighting of the softmax logits. DAC scales the temperature based on the textual “distance” between a novel class’ embedding and the set of base classes: $L^{\mathrm{(dac)}}_c(x) = \gamma(\hat{c}) \cdot \tau \cdot \mathrm{sim}(\phi(x), \psi(t'_c))$ where $\gamma$ is the “textual deviation” ratio of proximity to base distributions; lower $\gamma$ means greater distance and reduced confidence, directly tackling overconfidence on unseen classes.

CAC calibrates using the L₁ logit difference between fine-tuned and original CLIP models, with the scaling weight

$\gamma = \alpha \exp( - k z ),\quad z = \frac{1}{N} \sum_{i=1}^N |P_i - \hat{p}_i|$

adaptively squaring $\gamma$ outside defined thresholds. This approach reduces both under- and overconfidence, improving ECE, ACE, MCE, and PIECE across both train and unseen classes, and applies generically without retraining (Lv et al., 31 Jan 2025).

5. Parameter-Efficient and Modular Fine-Tuning Strategies

Recent studies have revisited classic model fine-tuning in CLIP, showing that selectively tuning only small subsets of parameters can uncover strong downstream adaptation performance while preserving the vast majority of pre-trained knowledge.

ClipFit (Li et al., 25 Sep 2024) tunes only the bias terms of the projection layer within the text encoder feed-forward networks (FFNs) and all LayerNorm parameters in the image encoder, regularized by a knowledge distillation loss: $L = L_{ce} + \beta L_{kg},\qquad L_{kg} = \frac{1}{K} \sum_{i=1}^K \cos( w^{(clip)}_i,\; w_i )$ This approach achieves over 7% absolute improvements in average harmonic mean accuracy for base-to-new transfer while having parameter overhead several magnitudes smaller than adapter-based or prompt-based methods, and better preserves generalization under distribution shift (Li et al., 25 Sep 2024).

ProLIP (Fahes et al., 7 Oct 2024) fine-tunes only the final visual embedding projection matrix via cross-entropy over the few-shot samples, with a Frobenius norm penalty to the original weights: $\mathrm{Loss} = L_{CE}(W_o) + \lambda \| W_o - W_o^{(0)} \|_F^2$ This regularization ensures stability, preventing overfitting even under limited supervision and supporting validation-free adaptation. ProLIP consistently outperforms prompt-tuning and adapter-based baselines on 11 few-shot and cross-domain benchmarks, matching or exceeding state-of-the-art with orders-of-magnitude fewer tunable parameters.

6. Pseudolabels, Synthetic Texts, and Unsupervised Strategies

Label scarcity is mitigated by harnessing CLIP’s inherent zero-shot capacities via pseudolabels and data-efficient tuning. A unified framework emerged wherein CLIP first pseudo-labels large sets of unlabeled images using its own zero-shot predictions, then iteratively refines prompts on these samples—a process shown to provide improvements of over 15 to 28 points in several semi-supervised and unsupervised paradigms (Menghini et al., 2023). The core technique optimizes

$\mathcal{L} = \mathcal{L}_{CE}(X_L, Y_L) + \lambda\, \mathcal{L}_{CE}(X_U, \tilde{Y}_U)$

with the “Robin Hood effect” of increasing accuracy for underperforming classes by balanced top-K sampling and iterative self-refinement, mitigating the inherent bias of conventional pseudolabeling.

LatteCLIP (Cao et al., 10 Oct 2024) extends this paradigm by generating multi-level synthetic texts for each image (class, individual, and group-level descriptions) via large multimodal models (LMMs), then fusing these with a dynamic weighting mechanism against robust per-class prototypes: $\bar{y} = (1 - \alpha) \frac{\sum_i w_i g(T_i)}{\sum_i w_i} + \alpha p_c$ Training uses a momentum update of prototypes to reduce instability from noisy, hallucinated texts. The method yields ~4.74 percentage points average top-1 accuracy gain over zero-shot CLIP and outperforms alternative unsupervised fine-tuning methods by significant margins across diverse domain-specific tasks.

7. Targeted Knowledge Manipulation: Unlearning and User-Guided Adaptation

Fine-tuned CLIP models can be post-hoc manipulated to forget targeted subgroups (e.g., biased or harmful content) while preserving general utility—critical for responsible deployment. The three-stage method (zhang et al., 3 Jun 2025) involves:

Forgetting via relative Fisher information criteria and LoRA-based tuning of chosen layers on the targeted set, minimizing

$\mathcal{L}_f = \sum_i \frac{g^{img}(x_i^{img}) \cdot g^{txt}(x^{txt})}{\|g^{img}(x_i^{img}) \cdot g^{txt}(x^{txt})\|}$

Reminding by distillation to align intermediate statistics on retaining samples with the pre-trained model's BatchNorm parameters,
Restoring via model merging (parameter interpolation) to maximize zero-shot accuracy on a calibration split.

This approach enables selective unlearning at sub-class granularity, even in the absence of pre-training data, and outperforms classic unlearning approaches by a large margin—maintaining zero-shot and cross-domain generalization.

Additionally, interactive user-guided fine-tuning (e.g., CLIP-Branches (Lülf et al., 19 Jun 2024)) enables on-the-fly relevance feedback in retrieval, quantizing embeddings and employing ensemble decision branches for rapid iterative re-ranking and expansion, supporting large-scale indexing without re-scanning massive catalogs.

Fine-tuned CLIP models encompass a rich methodological space, combining geometric reasoning, context regularization, adversarial robustness, prompt/pseudolabel iteration, parameter-efficient adaptation, and controllable knowledge manipulation. Empirical and theoretical findings converge to establish that carefully engineered fine-tuning—guided by explicit alignment, calibration, and regularization criteria—can transform pre-trained CLIP models into robust, efficient, and versatile tools for domain-adaptive and trustworthy multimodal reasoning.