Zero-Shot Accuracy in CLIP Models

Updated 14 July 2025

Zero-shot accuracy in CLIP models is the ability to classify images into unseen categories by measuring the similarity between image and text embeddings.
Advanced prompt engineering, including descriptive and diversified strategies, significantly enhances the model's generalization and fine-grained recognition.
Proxy learning and cross-modal attention techniques boost performance and robustness by aligning visual and textual modalities effectively.

Zero-shot accuracy in CLIP models refers to the capability of these vision-LLMs to perform classification or other tasks on target classes or prompts not seen during training, leveraging the alignment between images and language learned from large-scale pre-training. In zero-shot settings, CLIP predicts target classes by measuring similarity between an image’s embedding and the embeddings of text prompts representing each candidate class—a process that relies on CLIP’s ability to generalize the structure of multimodal data. The following sections provide a detailed survey of methodologies, performance analysis, augmentation strategies, robustness, and recent advances as established by core research in the field.

1. Fundamentals of Zero-Shot Accuracy in CLIP Models

CLIP models achieve zero-shot classification by encoding both an input image and candidate class descriptions (usually via natural language prompts) into a shared embedding space and computing a similarity measure—typically a dot product or cosine similarity. The class whose prompt yields the highest similarity is assigned to the input image. This methodology allows CLIP to generalize to tasks and categories not present in its pre-training data, provided that the target classes can be adequately captured via text prompts.

Formally, for an image $x$ and a set of $K$ candidate class prompts $\{z_j\}_{j=1}^K$ , the predicted class is:

$y = \arg \max_j \; z_j^\top x$

where $x$ and $z_j$ are the normalized embeddings from CLIP’s visual and text encoders, respectively. This formula serves as the basic inference procedure across zero-shot CLIP pipelines (2203.07190).

2. Prompt Engineering and Semantic Representation

The quality and design of text prompts have a marked impact on zero-shot accuracy in CLIP. Several methodologies have addressed the shortcomings of naive prompt choices:

Descriptive and Hierarchical Prompts: Zero-shot accuracy can be significantly improved by employing more descriptive natural language templates or incorporating hierarchies drawn from resources such as WordNet. Augmenting prompt templates with parent or child class descriptions captures both general and specific semantics, helping to resolve ambiguity and improve recognition of fine-grained classes (2212.01758).
Prompt Diversification: Recent work explores the use of sets of prompt variants—including synonyms and visually descriptive phrases—to span the lexical variability of natural language. The Synonymous Semantic Space ( $S^3$ ) approach generates multiple synonymous descriptions for each class with LLMs and constructs a compact topological subspace for each class in the text embedding space. A variety of point-to-space similarity metrics are then used, with the point-to-local-center metric providing robust improvements. This strategy demonstrates increased stability and accuracy, especially for fine-grained and open-vocabulary segmentation tasks (2412.04925).
Prompt Distribution Learning: Instead of relying on a single prototype text embedding per class, some frameworks (e.g., Frolic) assume a Gaussian mixture model over prototypes to capture intra-class textual variability. This approach includes learning a covariance matrix of the text prototypes using only the unlabeled downstream data, and then performing Gaussian discriminant analysis for classification (2410.19294).
Auto-tuned Prompt Weighting: AutoCLIP introduces per-image adaptive reweighting of the template prompts via a gradient-based update in the similarity space, optimizing for the most informative prompt compositions for each instance (2309.16414).

Prompt selection and design thus remain critical levers for zero-shot performance.

3. Proxy Learning and Alignment Strategies

The efficacy of zero-shot classification in CLIP is bounded by the alignment between its text and vision embedding spaces, yet empirical and theoretical analyses have highlighted an inherent modality gap:

Intra-Modal Proxy Learning: It has been demonstrated that the optimal classifier proxy for visual tasks need not coincide with the text embedding/description as used in CLIP’s dot-product objective. Instead, learning vision-only proxies using unlabeled target data—guided by refined pseudo-labels from the text proxy—can significantly close this gap. InMaP, a convex optimization routine operating on CLIP-extracted embeddings, has been shown to improve ImageNet zero-shot accuracy from 77.02% to 80.21% for ViT-L/14@336 models, with runtimes of under a minute on a single GPU (2310.19752).
Online and Batch Adaptation: Extensions of proxy learning to both online and transductive (batch) settings have yielded further improvements. OnZeta proposes real-time (online) adaptation of label probabilities and class proxies as data arrive sequentially, providing convergence guarantees and boosting classic CLIP zero-shot accuracy by 3%–4% on multiple benchmarks (2408.13320). Transductive batch methods (e.g., EM-Dirichlet) leverage the statistics of a batch of unlabeled data, modeling embeddings as sampled from Dirichlet distributions for each class and using joint inference to improve zero-shot accuracy by nearly 20% on ImageNet (2405.18437).
Cross-Modal and Parameter-Free Attention: Other research integrates the cross-modal information already present in CLIP. CALIP introduces a parameter-free bidirectional attention between visual and text features, allowing direct cross-modal interaction before the matching step. This approach leads to up to ~4% gains on some datasets and is training-free (2209.14169).

4. Empirical Performance and Comparative Analysis

Empirical studies underline both the strengths and remaining challenges of zero-shot CLIP models:

Raw Zero-Shot Accuracy: On core benchmarks like ImageNet, CLIP’s zero-shot accuracy has risen from mid-50s with initial architectures to over 81.8% on large models trained using budget-efficient inverse scaling strategies (e.g., CLIPA-v2) (2306.15658).
Impact of Fine-Tuning: Parameter-efficient strategies that restrict tuning to normalization and bias parameters (BiNor) offer competitive performance in few-shot scenarios and often enhance, rather than diminish, zero-shot transfer. Full model fine-tuning is generally discouraged due to overfitting risks in data-constrained regimes (2203.07190).
Category-Dependent Performance: Despite strong aggregate accuracy, CLIP may have 0% class-wise accuracy on certain categories (e.g., 10 labels on ImageNet). The Class-wise Matching Margin (\cmm) metric effectively diagnoses such cases by quantifying the separation of correct-vs-confused class similarities. Augmenting prompts for these categories can improve their accuracy, raising the worst-10 from 0% up to 5.2% without labeled validation (2310.03324).
Specialization and Ensembling: Organizing model hubs of task-specific expert models annotated with semantic model labels and using optimization-based head selection (CHCO) allows for the combination or replacement of zero-shot CLIP predictions, further improving accuracy in a scalable and modular fashion (2408.11449).

Method/Strategy	Main Impact	Typical Accuracy Gain
Prompt Engineering	Improved alignment and class disambiguation	2–4%
Proxy Learning/InMaP	Vision-aligned classifier with pseudo-label refinement	~3% (e.g., 77→80%)
Parameter-Efficient FT	Robust few-shot learning, enhanced zero-shot retention	1–2%
Transductive/Online	Improved batch/online inference via data aggregation	Up to 20% (batch)
Model Label Ensembling	Selective expert model reuse for increased domain coverage	Varies

5. Robustness, Limitations, and Transfer

Zero-shot CLIP models exhibit notable advances in robustness but remain vulnerable under several practical conditions:

Robustness to Distribution Shift: CLIP generally performs favorably under natural distribution shifts (e.g., ImageNet-V2, ObjectNet), with effective robustness occasionally surpassing supervised models. However, its robustness substantially drops under synthetic corruptions and adversarial perturbations, such as typographic attacks or white-box perturbations, wherein accuracy may decrease by 34.7% relative to clean inputs (2403.10499).
Role of Data Overlap: Analysis suggests that robustness gains under “natural” shifts may be inflated by data overlap between pre-training and test sets. When overlapping examples are excluded, performance drops, challenging attributions of generalization solely to language supervision.
Bias Correction and Distributional Priors: Methods such as Frolic introduce label-free bias correction mechanisms and distribution learning over prototype embeddings, automatically rebalancing predictions to correct for pre-training class imbalance and achieving up to 2.6% average accuracy gains across diverse datasets (2410.19294). CLIPPR similarly adapts zero-shot outputs (for classification and regression) using priors over the label distribution to bridge gaps between training and target data (2212.00784).
Augmentation for Small Object and 3D Tasks: Performance degrades when target objects occupy small image regions or when extending to 3D shape recognition. Guided cropping strategies such as GC-CLIP utilize zero-shot object detectors (e.g., OWL-ViT) for target localization, providing marked gains in such settings (2309.06581). For 3D objects, methods like MV-CLIP combine multi-view aggregation and hierarchical prompts to enable zero-shot recognition, achieving 84.44% on ModelNet40 without retraining (2311.18402).

6. Training-Free and LMM-Augmented Approaches

Recent advances further minimize the need for re-training and enable large multimodal models (LMMs) for supporting and accelerating zero-shot CLIP workflows:

Two-Stage LMM Augmentation (TLAC): This approach queries a LMM (e.g., Gemini) for an image description, then uses the CLIP text encoder to embed both the description and all class names, assigning the final label based on maximal similarity. A second query is invoked when semantic drift is detected (e.g., synonym vs. dataset label). TLAC, a purely training-free pipeline, achieved 83.44% accuracy on base-to-novel splits and outperformed state-of-the-art by 6.75% on several benchmarks (2503.12206).
Conformal Prediction for Reliability Assessment: For high-stakes or sensitive applications, conformal prediction methods assess whether the similarity and conformity of a test caption to training captions are sufficient for model trust. Statistical guarantees accompany the procedure, giving a bounded risk of invalidating in-distribution captions (2210.15805).

7. Outlook and Future Directions

Zero-shot accuracy in CLIP models has undergone marked improvements through prompt engineering, proxy and distributional alignment, model ensembling, and batch/online inference innovations. Several remaining challenges persist:

Improving robustness to synthetic, adversarial, and distribution shifts beyond data overlap artifacts.
Developing unsupervised, training-free methods that adapt to new tasks, domains, and long-tailed data in real-time.
Extending zero-shot reasoning to structured prediction tasks (e.g., segmentation, 3D recognition) through semantic space expansion and hierarchical aggregation.
Establishing interpretable, reliable confidence measures that enable deployment in risk-sensitive and regulatory environments.
Combining foundation models, specialized expert networks, and large multimodal models in flexible, plug-and-play zero-shot solutions.

The synthesis of alignment techniques, hybrid probabilistic and geometric modeling (e.g., $S^3$ (2412.04925), Frolic (2410.19294)), and continual adaptation promises continued progress toward more universally capable and robust zero-shot vision-language systems.