Zero-shot and Fine-Tuning Evaluations
- Zero-shot evaluations assess a model's out-of-the-box generalization by testing pretrained models on tasks without further adaptation.
- Fine-tuning evaluations adapt models to target tasks using supervised objectives, enhancing in-domain accuracy while potentially reducing OOD robustness.
- Comparative studies highlight the specialization–robustness trade-off, with methods like VRF and StarFT balancing in-domain gains against distribution shifts.
Zero-shot and fine-tuning evaluations are central methodologies for quantifying and improving the performance and robustness of modern large-scale models—especially vision-language and LLMs—across a range of in-domain and distribution-shifted tasks. Zero-shot evaluation probes the out-of-the-box generalization properties of models trained on broad, weakly or self-supervised objectives, while fine-tuning tailors these models to target tasks or domains, typically at the cost of altering their original robustness patterns. A growing body of research elucidates the mechanisms governing the transfer, degradation, and retention of both generalization and specialization under these two paradigms.
1. Foundations of Zero-Shot Evaluation
Zero-shot evaluation quantifies how well a pretrained model performs on downstream tasks without any gradient-based adaptation to those tasks. In vision-LLMs such as CLIP, the zero-shot protocol involves encoding each downstream class as a text prompt (e.g., “a photo of a [class]”), then selecting labels for a test image by maximizing the cosine similarity between the image embedding and each prompt embedding, possibly followed by softmax normalization: This zero-shot mapping exploits the model’s alignment of disparate modalities in a joint space. In LLMs, zero-shot evaluation typically involves providing the model with task instructions or classification labels via natural language prompt templates and evaluating its predictions without any further adaptation (Wei et al., 2021).
Key metrics include top-1 accuracy for classification tasks, AUROC for multi-label setups, and task-appropriate scores (e.g., macro-F₁, BLEU, ROUGE) for structured prediction and generation (Jang et al., 2022, Vogt-Lowell et al., 2023, Hadeliya et al., 2024).
2. Objectives and Protocols for Fine-Tuning Evaluation
Fine-tuning evaluation measures the efficacy of adapting a pretrained model to a new, often smaller, labeled dataset by optimizing a supervised objective, such as cross-entropy for classification, InfoNCE for contrastive learning, or other domain-specific losses (Zhu et al., 2024, Kim et al., 2024). Two main regimes are prevalent:
Full fine-tuning updates all model parameters. Parameter-efficient fine-tuning (PEFT)—such as LoRA, adapters, or prompt learning—modifies only a small subset while freezing the backbone, enhancing efficiency and sometimes OOD retention (Farhadzadeh et al., 29 May 2025, Kim et al., 2024).
The typical experimental protocol involves splitting the target dataset into train/val/test (or group/id/ood) splits. Models are further evaluated on out-of-distribution (OOD) splits corresponding to distribution shifts, such as geographical domains (WILDS), synthetic corruptions, or spurious-cue-controlled shifts (Waterbirds, PACS, DomainNet) (Vogt-Lowell et al., 2023, Nam et al., 2024, Kim et al., 19 May 2025).
Quantitative metrics include in-domain (ID) accuracy, OOD accuracy, worst-group (WG) and average-group metrics for group-shifted distributions, and calibration-related metrics such as Expected Calibration Error (ECE) and negative log-likelihood (NLL) under distribution shift (Kim et al., 19 May 2025, Nam et al., 2024, Zhu et al., 2024).
3. Comparative Performance and Trade-Offs: Zero-Shot vs Fine-Tuning
Pretrained vision-language and LLMs often exhibit strong generalization in zero-shot evaluation, especially on benchmarks with synthetic or modest distribution shifts (Vogt-Lowell et al., 2023, Kim et al., 2024). However, as shown across modalities and domains, fine-tuning can both improve in-domain accuracy and degrade zero-shot/OOD robustness (Kim et al., 19 May 2025, Zhu et al., 2024, Nam et al., 2024). This “specialization–robustness” trade-off is consistently observed:
| Method | In-domain Acc (%) | OOD Acc (%) | Robustness Gain (Δ) |
|---|---|---|---|
| Zero-shot (CLIP ViT-B/16) | 68.3 | 58.4 | Baseline |
| E2E-FT | 81.3 | 53.7 | – |
| Ensemble FT+ZS (OSE) | 82.2 | 60.2 | +1.8 (vs ZS) |
| VRF | 82.3 | 61.8 | +3.4 (vs ZS) |
Sample: ID/OOD accuracy for ImageNet and distribution shifts (Zhu et al., 2024)
Fine-tuning with standard objectives can lead to “forgetting” of OOD robustness and the acquisition of spurious correlations aligned with the target domain but detrimental elsewhere (Kim et al., 19 May 2025). Modern robust fine-tuning objectives—including StarFT's spurious textual alignment regularization (Kim et al., 19 May 2025), variance-reduction ensembles (VRF) (Zhu et al., 2024), Lipsum-FT's random-text alignment (Nam et al., 2024), and R-Adapter with self-ensembling (Kim et al., 2024)—explicitly target the preservation of zero-shot robustness, demonstrating measurable gains under OOD benchmarks (OOD accuracy increases of up to +5.6 %p over standard FT (Kim et al., 2024)).
4. Predictive Measures and Analysis Frameworks
Recent advances introduce predictive metrics—based on zero-shot embeddings—that anticipate both the extent of possible ID gain and OOD forgetting from fine-tuning without requiring target-task supervision (Niss et al., 2024). The Inter-Intra Modal Measure (IIMM) combines average cosine similarity between images and incorrect text labels with intra-image embedding similarity: Empirically, tasks with high IIMM scores yield greater fine-tuning gains but also larger OOD degradation; IIMM provides a reliable, computationally light prior on the expected learning/forgetting dynamics across multiple architectures and fine-tuning regimes.
Other approaches, such as zero-shot meta-learning surrogates, use meta-features and historical performance data to recommend models and hyperparameters for a new dataset, balancing expected fine-tuning efficiency against overfitting risk (Öztürk et al., 2022).
5. Innovations in Robust Fine-Tuning Methodology
Robust fine-tuning strategies are increasingly necessary to reconcile the efficiency and accuracy of downstream adaptation with the generalization and out-of-domain performance of foundational models:
- StarFT applies KL divergence alignment with zero-shot outputs on “spuriosity-injected” text prompts generated by prompting LLMs about likely confounders (background, texture, resolution), mitigating the acquisition of spurious features (Kim et al., 19 May 2025).
- Lipsum-FT preserves the vision-language joint energy landscape by regularizing fine-tuned logits on random “lipsum” token sequences, maintaining generic text/image alignment (Nam et al., 2024).
- Variance Reduction Fine-tuning (VRF) computes adaptive sample-wise ensemble weights between zero-shot and fine-tuned models using k-NN distances in the feature space to a set of “zero-shot failure” training images, effectively reducing predictive error variance under shift (Zhu et al., 2024).
- R-Adapter adds lightweight, drop-durable parameter-efficient modules to both transformer modalities with in-model self-ensemble mechanisms and a multi-positive margin InfoNCE loss, yielding simultaneous ID and OOD accuracy gains with minimal parameter overhead (Kim et al., 2024).
- Zero-Shot PEFT Transfer (ProLoRA) enables the transfer of LoRA adapters between diffusion models with no retraining by subspace and null-space projection, preserving the action of the source adapter in the target weight space (Farhadzadeh et al., 29 May 2025).
Emerging ablation protocols, such as varying strength/type of spuriosity, hyperparameter λ and adapter rank, or evaluating on diverse group-shifted and OOD datasets, clarify the robustness properties and limitations of each technique (Kim et al., 19 May 2025, Kim et al., 2024).
6. Extensions across Modalities and Specialized Domains
Zero-shot and fine-tuning evaluations have broad applicability:
- Medical imaging: Label-free and relaxed positive pair fine-tuning can significantly improve zero-shot classification of rare or multi-label pathologies, with gains exceeding board-certified radiologist performance on certain tasks (Jang et al., 2022).
- Cross-lingual transfer: Character-level noise during BERT fine-tuning facilitates transfer to dialects/languages with high lexical overlap and short OOV tokens, without degrading source accuracy (Srivastava et al., 2023).
- Text classification: Self-training on confident model predictions sharpens class separation, allowing plug-and-play adaptation of zero-shot entailment models without annotation (Gera et al., 2022).
- Instruction and chain-of-thought tuning: For LLMs, instruction tuning and CoT-augmented fine-tuning can transform large pretrained models into effective zero-shot/few-shot learners, often surpassing even larger purely zero-shot baselines (Wei et al., 2021, Kim et al., 2023). Evaluating such models requires universal metrics capable of cross-task and cross-format comparison, frequently leveraging LLM-based reward scorers (Faysse et al., 2023).
7. Evaluation, Diagnostics, and Future Outlook
Robust evaluation protocols should report both in-domain and OOD/group-shifted metrics, track calibration under distribution shift, and include the zero-shot baseline as a reference point whenever fine-tuning is performed (Zhu et al., 2024, Kim et al., 19 May 2025). Interpretability tools—such as IIMM or zero-shot surrogate models—can inform model selection, hyperparameter choices, and the anticipated forgetting/learning dynamics. When deploying robust fine-tuning recipes, practitioners should select or combine loss functions and parameter constraints that explicitly promote cross-domain alignment (e.g., energy gaps, KL or MSE alignment on auxiliary prompts) and iteratively validate on diverse held-out shifts (Nam et al., 2024, Kim et al., 19 May 2025, Kim et al., 2024).
A plausible implication is that, as foundation models are increasingly deployed in evolving, heterogeneous modalities (vision, language, medical, multilingual), seamless integration of zero-shot diagnostics, robust and efficient adapter/fine-tuning designs, and universal evaluation metrics will become the standard for rigorous model assessment and long-term generalization (Niss et al., 2024, Zhu et al., 2024, Faysse et al., 2023, Farhadzadeh et al., 29 May 2025).