Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 93 tok/s
Gemini 2.5 Pro 56 tok/s Pro
GPT-5 Medium 25 tok/s
GPT-5 High 22 tok/s Pro
GPT-4o 98 tok/s
GPT OSS 120B 452 tok/s Pro
Kimi K2 212 tok/s Pro
2000 character limit reached

Few-Shot Prompting Strategies

Updated 26 August 2025
  • Few-shot prompting strategies are techniques that guide large pre-trained models to perform tasks using a small number of labeled examples and in-context learning.
  • They employ automated template generation, dynamic demonstration selection, and reinforcement learning to optimize prompt stability and accuracy.
  • These methods are applied across various domains, including language, vision, and multimodal tasks, to improve performance under low-resource conditions.

Few-shot prompting strategies are a class of methodologies in which large pre-trained models—such as LLMs, vision-LLMs, or encoder–decoder architectures—are guided to perform downstream tasks given only a handful (often 1–32) of labeled examples per class or task. These techniques leverage the pre-trained knowledge and in-context learning abilities of the underlying models by embedding example-task pairs (“shots”) directly into prompts without or with light fine-tuning. Few-shot prompting has become an essential paradigm for numerous real-world applications where access to labeled data is limited, offering significant advances in domains spanning text, images, and multimodal data.

1. Reformulation of Tasks through Prompting

At the foundation of few-shot prompting lies the reformulation of downstream tasks (classification, regression, generation, etc.) into forms naturally aligned with the original pre-training objectives of LLMs—most commonly, masked LLMing (MLM) or next-token prediction.

  • For classification or regression, the input xx is embedded into a template T(x)T(x) containing a masked token [MASK], and the model predicts the probability of each class yy through the likelihood of associated label-word(s) appearing at [MASK]. The classification probability is computed as:

p(yx)=exp(wM(y)h[MASK])yYexp(wM(y)h[MASK])p(y|x) = \frac{\exp(\mathbf{w}_{M(y)} \cdot \mathbf{h}_{[\text{MASK}]})}{\sum_{y' \in Y} \exp(\mathbf{w}_{M(y')} \cdot \mathbf{h}_{[\text{MASK}]})}

where wM(y)\mathbf{w}_{M(y)} is the embedding of the mapped label-word for label yy, and h[MASK]\mathbf{h}_{[\text{MASK}]} is the model’s hidden state for the mask position (Gao et al., 2020).

  • For regression, values can be interpolated between anchor label-words, serving as mixture poles for continuous outputs (Gao et al., 2020).

This paradigm aligns the fine-tuning process with pre-training, allowing models to generalize from limited data.

2. Automated Prompt and Label Mapping

Manual prompt engineering—designing natural-language templates and selecting label tokens—has proven critical but labor-intensive and highly sensitive to minor variations. Consequently, automated strategies for prompt and label-word selection have been developed:

  • Automatic Template Generation: Methods such as those in LM-BFF utilize a pre-trained T5 encoder-decoder for span-filling to generate candidate templates from few-shot examples. These are decoded (e.g., beam search with wide beams), refined using the few-shot data, and the best-performing templates are selected (Gao et al., 2020).
  • Automatic Label Word Selection: Statistical algorithms compute, for each class yy, the average probability of each vocabulary token being produced at [MASK] when prompted with instances of yy. The token maximizing zivz_i^v (average class-probability) across the training set is selected, with multi-label strategies (assigning multiple tokens per class) providing additional robustness and supervision, especially in extremely low-shot settings (Wang et al., 2022). The mapping function is:

p(yx)=vS(y)p([MASK]=vx)p(y | x) = \sum_{v \in \mathcal{S}(y)} p([\text{MASK}] = v \mid x')

where S(y)\mathcal{S}(y) denotes the selected set of tokens for class yy.

Multi-label prompting, label-word deduplication, and ranking heuristics based on zero-shot or few-shot dev accuracy further increase performance and interpretability.

3. Demonstration and Context Management

How example demonstrations are incorporated is critical for few-shot learning efficacy:

  • Contextual Demonstration Selection: Rather than concatenating randomly chosen examples, dynamic and selective sampling strategies use semantic similarity (e.g., SBERT embeddings) to pair each input with demonstrations most analogous in content (Gao et al., 2020). For nn-way classification, one demonstration per class is concatenated, chosen among the top-k most contextually similar candidates.
  • Prototype-based Prompting (vision-LLMs): Task-level prompts (universal for the dataset) are efficient but not adaptive; instance-level prompts are adaptive but inefficient. Prototype-based methods, such as PTP (Zhang et al., 2022), cluster image features to select KK prototypes {Pk}\{P_k\} and learn a corresponding set of prompt prototypes {Tk}\{T_k\}. For a given image xx, cluster soft assignments sim(x,Pk)sim(x, P_k) determine the weighted usage of each prompt:

Prob(x,c)=k=1Ksim(x,Pk)ProbTk(x,c)Prob(x, c) = \sum_{k=1}^K sim(x, P_k) \cdot Prob_{T_k}(x, c)

addressing the trade-off between generalization and efficiency.

  • Chain-of-Thought and Program-based Demonstrations: For tasks like math word problem solving, using in-context exemplars that include reasoning traces—expressed as either programs or step-by-step CoT answers—and retrieving the most similar ones yields state-of-the-art accuracy (Jie et al., 2023). Answers are verified by program execution, providing robust instance-level validation.

4. Stability, Robustness, and Optimization

Few-shot prompting can suffer from high variance due to: prompt initialization, demonstration selection, and run-to-run idiosyncrasies. Several research efforts have introduced meta-strategies to address these challenges:

  • Ensembling: Ensembling predictions across multiple random runs and/or multiple prompt templates (ENSEMBLE_pred), or ensembling model parameters (ENSEMBLE_para), systematically reduces run-to-run variability and increases performance stability (Köksal et al., 2022).
  • Active Learning for Demonstration Selection: Data selection using prompt-pair KL divergence (measuring how much different prompts disagree), coupled with k-means clustering for diversity, ensures the chosen examples are both informative and representative (Köksal et al., 2022).
  • Contrastive Learning for Prompt Optimization: StablePT uses input separation between hard and soft prompts and a supervised contrastive loss to enforce class-aware representations, addressing issues of initialization noise and run instability (Liu et al., 30 Apr 2024):

LCL=1bi=1b1yi=yjlog(exp(sim(Hsp,i,Hsp,j)/T)kexp(sim(Hsp,i,Hsp,k)/T))L_{CL} = - \frac{1}{b} \sum_{i=1}^b \mathbf{1}_{y_i = y_j} \log \left( \frac{\exp ( sim(H_{sp,i}, H_{sp,j}) / T ) } {\sum_k \exp( sim(H_{sp,i}, H_{sp,k}) / T ) } \right )

with HspH_{sp} denoting the soft prompt embedding and TT the temperature.

  • Policy Gradient and RL-Based Selection: RL-based frameworks (DP₂O (Li et al., 2023), POEM (Do et al., 14 Aug 2024)) use policy networks (small, parameter-efficient MLPs) and episodic memory to select or order prompts and few-shot examples for each query, optimizing on reward signals such as accuracy or uncertainty metrics. Example ordering in the prompt is chosen by maximizing an episodic-memory-weighted sum of observed rewards for similar past prompt configurations, with similarity computed using vector embeddings.

5. Task- and Domain-Agnostic Methodologies

Task-agnostic frameworks extend few-shot prompting effectiveness across diverse NLP and multimodal tasks without task-specific engineering:

  • Unified Prompt Tuning (UPT) (Wang et al., 2022) trains PLMs using a Prompt-Options-Verbalizer (POV) triple to condition the model on a wide variety of prompt forms and output options. An auxiliary knowledge-enhanced masked LLMing loss further improves generalizability, reducing sensitivity to prompt format and template.
  • Contextual Few-Shot Prompting with Retrieval (Kumar et al., 15 Mar 2025): For tasks such as Text-to-SQL, relevant demonstrations are retrieved dynamically based on embedding similarity (e.g., via FAISS and vector databases), integrating context adaptation with few-shot template prompting to improve both latency and accuracy, and facilitating robust deployment in systems with role-based access and data security constraints.
  • Cross-Lingual Few-Shot Prompting (Toukmaji, 9 Mar 2024): Direct prompting in the target (low-resource) language with in-context examples outperforms translation-based or additional language-adaptive pre-training approaches, preserves the PLM’s instruction-following capabilities, and avoids degradation from over-fine-tuning.

6. Advances in Few-Shot Prompting for Vision and Multimodal Tasks

Recent work extends few-shot prompting to visual and multimodal domains:

  • Knowledge Prompting for Video Action Recognition (Shi et al., 2022): Leverages large language-generated or automatically extracted action descriptions as “text proposals” to prompt a visual encoder (e.g., CLIP), turning matching scores into rich, transferable semantic representations for each video frame, followed by a temporal modeling network with convolutions and self-attention for sequence aggregation.
  • Semantic Prompts for Visual Feature Extraction (Chen et al., 2023): Incorporates text-based class or label embeddings into the visual Transformer at both spatial (via self-attention) and channel (via MLP modulation) dimensions, optimizing the feature extractor itself for class specificity—results show consistent 3–4% improvements in 1-shot image recognition benchmarks compared to classifier-head semantic fusion.
  • VQA Prompt Design and Chain-of-Thought Reasoning (Awal et al., 2023): Structured question templates, image-caption augmentation, and text-only in-context demonstrations together optimize few-shot VQA performance. Chain-of-thought reasoning and self-consistency-based answer selection are shown to require careful balancing to avoid accuracy loss.

7. Beyond Manual Engineering: Automated, Meta- and Synthetic Data Strategies

Several strategies reduce the need for hand-crafted prompts or curated datasets:

  • Prompt Pooling and RL-Matched Discrete Prompting (Li et al., 2023): GPT-4-driven multi-round dialogue alignment creates diverse, readable prompt sets, which are then optimized per-instance via policy gradient methods guided by entropy-based reward metrics.
  • Synthetic Data Generation via Prompting (Schmidt et al., 15 May 2024): LMs are prompted to produce additional labeled examples given a candidate context and answer entity (e.g., from NER), using strategies such as beam and nucleus sampling, followed by QA-consistency filtering. This approach narrows the gap with full-data regimes and generalizes across extractive QA benchmarks.

8. Key Takeaways and Outlook

The landscape of few-shot prompting strategies has advanced rapidly with innovations in automated prompt/template discovery, dynamic and context-aware demonstration selection, memory/retrieval-augmented generation, reinforcement learning-based optimization, and architectures that explicitly model stability and robustness. These strategies consistently outperform both conventional fine-tuning (under low-resource conditions) and even sophisticated in-context learning with very large models, when paired with mid-sized PLMs. Furthermore, meta-cognitive prompting (reflection and positive reinforcement) and pairwise-based frameworks (MetricPrompt) demonstrate that reframing tasks in alignment with the pre-training objectives of LMs can bolster few-shot performance even further.

A summary table organizing representative strategies, their mechanisms, and their distinguishing features:

Strategy/Class Core Mechanism Distinguishing Feature
Prompt-based fine-tuning (Gao et al., 2020) MLM reformulation, softmax over [MASK], auto template/label mapping Task-agnostic, dynamic demos, up to 30% gain
Multi-label automatic mapping (Wang et al., 2022) Summed [MASK] label probabilities per class Noise robustness, no manual engineering
Active ensemble/AL (Köksal et al., 2022) Multiprompt ensembling, prompt-pair KL selection Stability, reduced run variance
Prototype-based prompting (Zhang et al., 2022) Latent clustering, prompt prototypes Parameter- vs. instance-level adaptivity
Contrastive StablePT (Liu et al., 30 Apr 2024) Input separation, contrastive loss Robustness to prompt/init noise
RL/episodic prompt optimization (Li et al., 2023, Do et al., 14 Aug 2024) Policy gradient, memory, reward-based ordering Performance-driven prompt matching
Retrieval-augmented, contextual (Kumar et al., 15 Mar 2025) Embedding-based retrieval for context Scalability, improved accuracy/latency
Meta-cognition/reflective (Ji et al., 2023) Model self-reflection, positive feedback Improved mapping/generalization

Few-shot prompting continues to be a highly active research area, with significant opportunities for advancing its foundations, stability, and applicability across a growing range of modalities and domains.