Prompt-Induced Score Variance in Zero-Shot Binary Vision-Language Safety Classification

Published 1 May 2026 in cs.CL and cs.CV | (2605.00326v1)

Abstract: Single-prompt first-token probabilities from zero-shot vision-LLM (VLM) safety classifiers are treated as decision scores, but we show they are unreliable under semantically equivalent prompt reformulation: even when the binary label is constrained to a fixed output position, equivalent prompts can induce materially different unsafe probabilities for the same sample. Across multimodal safety benchmarks and multiple VLM families, cross-prompt variance is strongly associated with prompt-level disagreement and higher error, making it a useful fragility diagnostic. A training-free mean ensemble improves NLL on all 14 dataset-model evaluation pairs and ECE on 12/14 relative to a train-selected single-prompt baseline, and wins more head-to-head NLL comparisons than labeled temperature scaling, Platt scaling, and isotonic regression applied to the same prompt. Ranking gains are consistent against the train-selected baseline on both AUROC and AUPRC, and against the full 15-prompt distribution remain consistent on AUPRC while softening on AUROC. Labeled calibration on top of the mean provides further gains when labels are available, identifying prompt averaging as a strong label-free first stage rather than a replacement for calibration. We frame this as a reliability stress test for zero-shot VLM first-token safety scores and recommend prompt-family evaluation with mean aggregation as a standard label-free reliability baseline.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper demonstrates that semantically equivalent prompts can induce significant variability in unsafe probability scores for zero-shot VLM safety classification.
The study evaluates seven VLM families across safety benchmarks, showing that averaging prompt scores reduces negative log-likelihood and improves calibration error.
The work highlights that prompt-induced variance is a key diagnostic for prediction fragility and calls for mean prompt aggregation as a reliable baseline.

Prompt-Induced Score Variance in Zero-Shot Binary Vision-Language Safety Classification

Problem Motivation and Scope

This work investigates the reliability of zero-shot binary safety classification with vision-LLMs (VLMs), focusing specifically on how semantically equivalent prompt reformulations can induce substantial instability in first-token probability scores. It interrogates the generally unexamined assumption that using a single prompt yields deployable, stable probability estimates for safety decisions. The study empirically establishes that this assumption does not hold in practice, even when the output label is stringently constrained to a fixed output position. The analysis spans seven publicly available VLM model families and two multimodal safety benchmarks—UnsafeBench and HoliSafe-Bench—directly in their zero-shot model-frozen regimes.

Prompt Fragility: Empirical Evidence

Extensive evaluation reveals that semantically equivalent prompts can elicit materially different unsafe probability assignments for identical image-query pairs, leading to dramatic sample-level score volatility. The primary diagnostic introduced is the cross-prompt standard deviation $\sigma_i$ in per-sample unsafe probability. Aggregating results over all model-dataset pairs, samples in the highest $\sigma_i$ decile consistently exhibit both increased prompt-level label disagreement and elevated classification error rates relative to the lowest decile, indicating that cross-prompt variance is correlated with fragility of the safety prediction signal.

Figure 1: Prompt-induced score fragility in zero-shot binary safety classification: higher cross-prompt variance is associated with increased disagreement and error rates on individual samples.

Analysis of D10 $-$ D1 decile gaps across all considered models confirms the universality of this fragility pattern (Figure 2), which persists across diverse prompt families (strict label-only versus label-with-explanation/continuation), ruling out the possibility of a style-induced artifact.

Figure 2: Cross-family fragility gaps between highest- and lowest- $\sigma_i$ deciles in model families: elevated mistake and disagreement rates in high-variance deciles regardless of model or benchmark.

Mean Prompt Aggregation as a Reliability Baseline

To mitigate per-prompt instability, the mean prompt ensemble—simply averaging scores across a family of semantically equivalent prompts—is proposed as a training-free, label-free baseline aggregation strategy. Across 14 model-dataset evaluation pairs, this ensemble method demonstrates strictly lower negative log-likelihood (NLL) and improved expected calibration error (ECE) in the majority of cases compared to both train-selected and randomly selected single-prompt baselines (NLL: 14/14 pairs, ECE: 12/14 pairs). These results are robust to prompt selection protocol, ECE binning schemes, and model scale.

Figure 3: Delta improvements (mean ensemble vs. train-selected prompt) for reliability (NLL/ECE) and ranking (AUROC/AUPRC); positive values denote improved calibration or ranking.

Calibration curves consistently illustrate that mean-ensemble scores are substantially better calibrated than any single-prompt baseline across benchmarks, especially as visualized in reliability diagrams (see also Figure 4).

Ranking vs. Score Reliability

While mean prompt aggregation robustly enhances probability reliability (NLL/ECE), the improvements in ranking metrics (AUROC, AUPRC) are less uniform. Gains in AUPRC relative to both the median and mean of the 15-prompt distribution are consistent (13/14 pairs), but AUROC improvements are more marginal (9/14 pairs against the median). These results indicate that stability in probability estimates and ranking ability, though related, are not synonymous.

Prompt Aggregation Versus Labeled Calibration

A strictly training-free mean ensemble is empirically found to be competitive with—and often superior to—standard post-hoc calibration techniques (temperature scaling, Platt scaling, isotonic regression) applied to a selected single prompt, based on NLL. When post-hoc calibration methods are stacked on top of mean prompt aggregation, further NLL and ECE improvements are observed (11–12/14 pairs). Notably, calibration cannot overcome prompt-induced fragility entrenched at the score extraction stage.

Figure 5: Improvement in NLL as a function of the number of top-k prompts averaged for the mean ensemble; most benefit accrues within the first few prompts.

Selective Prediction and Calibration Under Coverage Constraints

The cross-prompt variance $\sigma_i$ is intuitively appealing as an abstention signal for selective prediction (i.e., abstaining on high-variance samples), but its utility is shown to be mixed and dependent on the evaluation metric and model. While cross-prompt entropy and margin-based uncertainty signals occasionally outperform variance in ranking the hardest samples for abstention, the core finding is that cross-prompt variance is more diagnostic of prediction fragility than an optimal abstention mechanism.

Figure 6: Retained error curves in the high-coverage regime demonstrate the marginal benefits of different uncertainty signals for selective prediction.

Theoretical Implications and Recommendations

The presence of prompt-induced score fragility in zero-shot binary safety classification has both theoretical and practical ramifications. From a theoretical perspective, the study raises critical questions on the foundational validity of using first-token probability as a deployable decision score without prompt averaging—probability instability is an emergent property of the interface between model and prompt, independent of overall accuracy. Practically, a mean-ensemble baseline should be reported alongside traditional metrics (AUROC, AUPRC), and cross-prompt variance should serve as an auditable reliability diagnostic for downstream systems relying on probabilistic thresholds.

The work also frames prompt family variation as a stress-test methodology for reliability analysis, positioning prompt engineering as a principled component of evaluation, not merely a tuning heuristic for single prompt performance.

Limitations and Future Directions

The analysis is scoped to binary safety classification tasks operating in a model-frozen, inference-only regime. Extensions to structured outputs, more complex classification spaces, or generative tasks are outside the present scope. The investigated aggregation techniques are restricted to training-free or simple post-hoc variants; further exploration of learned or Bayesian aggregation, as well as more sophisticated abstention policies, remains open.

Expansion to more diverse multimodal settings, adaptive prompt selection, and real-world prevalence conditions are needed to generalize findings. Furthermore, the transferability of prompt-family variance as a reliability signal to non-binary or non-safety tasks is empirically unresolved.

Conclusion

This study establishes that zero-shot first-token probabilities used in VLM binary safety classification are unstable under semantically equivalent prompt reformulation, challenging the deployment of single-prompt decision pipelines. Mean prompt aggregation is a robust, label-free baseline that enhances score reliability and ought to be adopted as a standard evaluation practice. Cross-prompt variance is both a diagnostic of prediction fragility and a tool for systematic reliability auditing, shifting the paradigm for prompt-oriented evaluation in zero-shot multimodal safety classification.

Figure 4: Reliability diagrams: mean-ensemble scores are consistently closer to perfect calibration (diagonal) than single-prompt scores across all four core evaluation settings.