Zero-Shot PS+ Prompting

Updated 27 November 2025

Zero-shot PS+ prompting is a framework that enhances model performance by decomposing problems and optimizing prompts without any labeled data.
It leverages unsupervised signals like language model perplexity, sensitivity scores, and prompt consistency to select and refine prompts across textual and multimodal tasks.
Empirical evaluations show that PS+ strategies significantly reduce calculation and missing-step errors, improving accuracy in reasoning, classification, and multimodal challenges.

Zero-shot PS+ Prompting is a family of annotation-free prompting strategies for LLMs and vision-LLMs (VLMs) that emphasize explicit problem decomposition, stepwise reasoning, or prompt optimization—without any labeled data. These methods, unified by the “PS+” principle, systematically structure model inputs to enhance zero-shot transfer, stability, and performance across textual and multimodal tasks. Canonical PS+ approaches include Plan-and-Solve Plus (PS+), Perplexity Selection (Perplection), prompt consistency distillation, dynamic instance-level prompt rewriting, sensitivity-weighted ensembling, and visually-adaptive prompt retrieval.

1. Conceptual Foundations of Zero-Shot PS+ Prompting

Zero-shot PS+ Prompting is characterized by strategies that augment standard zero-shot prompting with an explicit, generally stepwise, procedural or evaluative scaffold, but without leveraging any downstream task labels or exemplars. Unlike vanilla zero-shot prompts—which might append a single instruction (“Let’s think step by step”) or use an uncalibrated template—PS+ prompts aim to systematically regularize, select, or rewrite prompts based on unsupervised signals (perplexity, consistency, prompt sensitivity, or visual alignment).

A paradigmatic example is the Plan-and-Solve Plus (PS+) prompt (Wang et al., 2023), which for a reasoning task instructs the LLM to:

1
2
3

Q: <problem statement>
A: Let’s first understand the problem, extract relevant variables and their corresponding numerals, and devise a complete plan.
Then, let’s carry out the plan, calculate intermediate variables (pay attention to correct numerical calculation and commonsense), solve the problem step by step, and show the answer.

This template enforces modularized processing—variable extraction, planning, calculation—within a single, label-free interaction.

2. Core Methodologies

Several methodological instantiations of PS+ emerge in the literature:

2.1. Plan-and-Solve Plus (PS+)

PS+ prompts decompose reasoning tasks into explicit planning, variable extraction, subgoal execution, and careful calculation (Wang et al., 2023). The prompt explicitly instructs the model to enumerate all relevant variables, devise a solution strategy, compute all intermediate results, and present the final answer with all steps shown. This hierarchical scaffold is empirically shown to reduce both “missing-step errors” and “calculation errors” compared to “Let’s think step by step” style CoT prompts.

2.2. Perplexity Selection (Perplection)

Perplection (Lu et al., 2022) selects among a pool of candidate cloze-style prompts for zero-shot classification by screening for the lowest mean LLM perplexity over a set of unlabeled in-domain examples. The hypothesis is that lower perplexity identifies templates whose format aligns better with the pretraining distribution, thus yielding higher zero-shot accuracy. The Perplection procedure involves iterating over all candidate templates, generating prompted inputs with each label word, computing mean sequence perplexity, and selecting the template with the lowest average value.

2.3. Prompt Consistency Distillation

Prompt consistency (Zhou et al., 2022) regularizes a model to produce stable outputs across multiple semantically-equivalent prompts. Using only unlabeled data, the model is fine-tuned with a swarm-distillation loss, encouraging consistent predictions regardless of natural-language template, followed by ensembling at inference. This approach is fully zero-shot and leverages diversity in prompt phrasing to improve generalization.

2.4. Instance-level Prompt Optimization

Methods such as PRomPTed (Srivastava et al., 2023) use a meta-optimization loop, with an LLM-in-the-loop that critiques and adaptively rewrites the prompt for each test instance. The meta model iteratively rewrites prompts based on their output until a “correct” solution is detected by the meta-LLM itself. This approach consistently surpasses static zero-shot CoT, standard output self-refinement, and few-shot baselines, and generalizes across task types, LLM backbones, and even when the meta-LLM is weaker than the solver-LLM.

2.5. Prompt Sensitivity Plus (PS⁺) and Bias-corrected Prompt Ensembling

PS⁺ prompting for sentiment classification (Chakraborty et al., 2023) expands a base prompt into thousands of variants using unsupervised paraphrasing, position shuffling, and subordination. Each variant is ranked via a data-free sensitivity score—favoring prompts whose predictions flip under label polarity swaps and remain robust to synonym replacement. In contrast, bias-corrected prompt weighting in image classification (Allingham et al., 2023) ranks candidate prompts via a bias-subtracted signal (average per-image class-max logit minus average prompt-class dot product) over unlabeled data, to address pretraining-induced pathologies in prompt selection and stabilize zero-shot ensembles.

3. Evaluation Protocols and Quantitative Impact

Zero-shot PS+ frameworks have been validated across domains:

Reasoning/Math: On GSM8K, SVAMP, MultiArith, and similar datasets, PS+ yields mean accuracy of 76.7%, outperforming Zero-Shot-CoT (70.4%) and matching or exceeding few-shot CoT (77.6%) (Wang et al., 2023). It particularly suppresses calculation and missing-step errors.
Text Classification: Perplection delivers 4.8–15 percentage point gains on Chinese and English sentiment/topic tasks compared to random prompt selection, and outperforms Zero-PET and NSP-BERT (the latter using a small labeled dev set) on several datasets (Lu et al., 2022).
LLM Feedback Tasks: In code feedback generation, explicit PS+ prompts (CoT, ToT, ReAct) increase precision by 8–12 points relative to vanilla instructions, with only minor reduction in error-identification rates (Ippisch et al., 2024).
Instance-level Prompt Optimization: PRomPTed improves over naive zero-shot (57.5%) and Zero-Shot-CoT (70.2%) to reach 76.4% accuracy across a suite of reasoning, QA, and extraction tasks with GPT-4 as both task and meta model (Srivastava et al., 2023).
Prompt Consistency: Swarm distillation elevates median accuracy by up to +10.6 points on NLI and completion datasets with no labeled data, by ensembling over just 4–8 paraphrased prompts per task (Zhou et al., 2022).
Image/Text Matching: Bias-corrected prompt weighting boosts CLIP zero-shot top-1 accuracy from 77.0% (equal-weight 80 prompts) to 77.4% on ImageNet, with similar relative gains on domain-shifted and fine-grained datasets (Allingham et al., 2023).

4. Algorithmic and Implementation Details

PS+ methods typically involve one or more of the following algorithmic stages:

Template Pool Construction: Manually or automatically (e.g., via T5 or masked-LM augmentation) assemble a moderate-sized set (8–15) of prompt candidates.
Unlabeled Data Utilization: Use 20–50 representative unlabeled samples from the target domain for screening or prompt-consistency training. Pool diversity is critical; near-duplicates reduce standard deviation and dilutive selection power.
Prompt Scoring: Compute an unsupervised proxy, e.g., mean LLM perplexity (Perplection), prompt-sensitivity score (flip vs. synonym), or bias-corrected softmax score for image prompts.
Selection or Weighting: Select the template with minimal perplexity (Perplection), maximal sensitivity score (ZS-SC), or use a normalized ensemble weighted by corrected scores (PS+ ensemble).
Structured Prompt Design: For stepwise PS+ prompts, enforce an explicit chain-of-thought or modular plan-execute structure, often explicitly referencing intermediate variables and steps.
Adapter/Continued Pretraining Option: In some settings (e.g., (Wu et al., 2022)), inject a trainable soft prompt matrix during multitask pretraining to improve promptability both for zero- and few-shot transfer.

5. Theoretical and Empirical Justification

The robust generalization of zero-shot PS+ approaches is underpinned by both empirical evidence and theoretical analysis:

PAC-Bayes Analysis: For discrete prompt engineering in vision-LLMs, PAC-Bayes bounds using a LLM prior are remarkably tight, often within a few percentage points of true test error. This theoretical guarantee explains why even exhaustively prompt-tuned classifiers rarely overfit, as the prompt class is small and regularized (Akinwande et al., 2023).
Sensitivity Metrics: Prompt selection by sensitivity to label-keyword flipping and robustness to synonyms correlates strongly with empirical accuracy (Chakraborty et al., 2023).
Prompt Pool Diversity: Gains saturate for pools of 4–8 diverse prompts; larger pools bring little benefit but dilute identifiability.
Ensembling and Consistency: Ensembling over regularized prompt-consistent models or bias-corrected prompt weights provides robustness to prompt perturbations, label bias, and pretraining artifacts (Zhou et al., 2022, Allingham et al., 2023).
Error-Type Analysis: Structured PS+ prompts reduce calculation and missing-step errors but have less effect on semantic misreading, which are often tied to underlying model capacities rather than prompt structure (Wang et al., 2023).

6. Extensions to Multimodal and Instance-aware Settings

Zero-shot PS+ principles generalize beyond language-only settings:

Visual Adaptive Prompting: Vision-language architectures incorporate both dynamic text and visual prompt selection, where an image-conditioned soft prompt repository and prefix-tuning are jointly trained to encourage rich compositional generalization (Stein et al., 27 Feb 2025).
Interaction-aware Video Action Detection: For spatio-temporal action detection, interaction-aware PS+ prompting aligns person-centric visual-context features with refined per-label text embeddings using cross-attention blocks and instance-level prompt adaptation, significantly improving performance over frozen or unadapted CLIP baseline (Huang et al., 2023).
Instance-level LLM Feedback: Meta-LLMs that critique and rewrite prompts at the instance level—without any labels—substantially improve task accuracy and coverage across reasoning, extraction, code, and safety-oriented tasks (Srivastava et al., 2023).

7. Practical Recommendations and Limitations

Key guidelines drawn from benchmarked PS+ approaches include:

Favor prompt pools with moderate diversity and matched length, balancing coverage and stability.
For annotation-free prompt selection, mean LM perplexity or data-free sensitivity scores can both serve as reliable proxies for downstream utility (Lu et al., 2022, Chakraborty et al., 2023).
Avoid prompt chaining that rigidly enumerates every input dimension; over-prescription may confuse the model. Stepwise, modular reasoning structures are preferred (Ippisch et al., 2024).
Fine-tuning soft prompt adapters during multitask continued pretraining consistently improves zero-shot promptability; meta-learning approaches confer less relative benefit and are destabilized by heterogeneity (Wu et al., 2022).
Limit the number of prompt candidates to maximize discriminative selection power; test-time ensembling across top-k candidates can smooth idiosyncratic errors.
For multimodal tasks, retrieval-augmented (image- or video-conditioned) prompt selection and fusion is critical to obtain robust compositional and contextual generalization.
PS+ approaches are constrained by foundational model capabilities and may not mitigate errors stemming from semantic misunderstanding, dataset leakage, or cross-domain drift.

Zero-shot PS+ Prompting thus constitutes a set of robust, interpretable, and empirically substantiated techniques for leveraging large models in settings where no labeled data is available, with methodological breadth ranging from LLM perplexity screening to explicit problem-solving decomposition and instance-specific prompt optimization (Lu et al., 2022, Wang et al., 2023, Srivastava et al., 2023, Ippisch et al., 2024, Chakraborty et al., 2023, Allingham et al., 2023, Zhou et al., 2022, Wu et al., 2022, Stein et al., 27 Feb 2025, Huang et al., 2023, Akinwande et al., 2023).