Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 183 tok/s

Gemini 2.5 Pro 46 tok/s Pro

GPT-5 Medium 30 tok/s Pro

GPT-5 High 28 tok/s Pro

GPT-4o 82 tok/s Pro

Kimi K2 213 tok/s Pro

GPT OSS 120B 457 tok/s Pro

Claude Sonnet 4.5 36 tok/s Pro

2000 character limit reached

Few-Shot Prompting Strategies

Updated 26 August 2025

Few-shot prompting strategies are techniques that guide large pre-trained models to perform tasks using a small number of labeled examples and in-context learning.
They employ automated template generation, dynamic demonstration selection, and reinforcement learning to optimize prompt stability and accuracy.
These methods are applied across various domains, including language, vision, and multimodal tasks, to improve performance under low-resource conditions.

Few-shot prompting strategies are a class of methodologies in which large pre-trained models—such as LLMs, vision-LLMs, or encoder–decoder architectures—are guided to perform downstream tasks given only a handful (often 1–32) of labeled examples per class or task. These techniques leverage the pre-trained knowledge and in-context learning abilities of the underlying models by embedding example-task pairs (“shots”) directly into prompts without or with light fine-tuning. Few-shot prompting has become an essential paradigm for numerous real-world applications where access to labeled data is limited, offering significant advances in domains spanning text, images, and multimodal data.

1. Reformulation of Tasks through Prompting

At the foundation of few-shot prompting lies the reformulation of downstream tasks (classification, regression, generation, etc.) into forms naturally aligned with the original pre-training objectives of LLMs—most commonly, masked LLMing (MLM) or next-token prediction.

For classification or regression, the input $x$ is embedded into a template $T(x)$ containing a masked token [MASK], and the model predicts the probability of each class $y$ through the likelihood of associated label-word(s) appearing at [MASK]. The classification probability is computed as:

$p(y|x) = \frac{\exp(\mathbf{w}_{M(y)} \cdot \mathbf{h}_{[\text{MASK}]})}{\sum_{y' \in Y} \exp(\mathbf{w}_{M(y')} \cdot \mathbf{h}_{[\text{MASK}]})}$

where $\mathbf{w}_{M(y)}$ is the embedding of the mapped label-word for label $y$ , and $\mathbf{h}_{[\text{MASK}]}$ is the model’s hidden state for the mask position (Gao et al., 2020).

For regression, values can be interpolated between anchor label-words, serving as mixture poles for continuous outputs (Gao et al., 2020).

This paradigm aligns the fine-tuning process with pre-training, allowing models to generalize from limited data.

2. Automated Prompt and Label Mapping

Manual prompt engineering—designing natural-language templates and selecting label tokens—has proven critical but labor-intensive and highly sensitive to minor variations. Consequently, automated strategies for prompt and label-word selection have been developed:

Automatic Template Generation: Methods such as those in LM-BFF utilize a pre-trained T5 encoder-decoder for span-filling to generate candidate templates from few-shot examples. These are decoded (e.g., beam search with wide beams), refined using the few-shot data, and the best-performing templates are selected (Gao et al., 2020).
Automatic Label Word Selection: Statistical algorithms compute, for each class $y$ , the average probability of each vocabulary token being produced at [MASK] when prompted with instances of $y$ . The token maximizing $z_i^v$ (average class-probability) across the training set is selected, with multi-label strategies (assigning multiple tokens per class) providing additional robustness and supervision, especially in extremely low-shot settings (Wang et al., 2022). The mapping function is:

$p(y | x) = \sum_{v \in \mathcal{S}(y)} p([\text{MASK}] = v \mid x')$

where $\mathcal{S}(y)$ denotes the selected set of tokens for class $y$ .

Multi-label prompting, label-word deduplication, and ranking heuristics based on zero-shot or few-shot dev accuracy further increase performance and interpretability.

3. Demonstration and Context Management

How example demonstrations are incorporated is critical for few-shot learning efficacy:

Contextual Demonstration Selection: Rather than concatenating randomly chosen examples, dynamic and selective sampling strategies use semantic similarity (e.g., SBERT embeddings) to pair each input with demonstrations most analogous in content (Gao et al., 2020). For $n$ -way classification, one demonstration per class is concatenated, chosen among the top-k most contextually similar candidates.
Prototype-based Prompting (vision-LLMs): Task-level prompts (universal for the dataset) are efficient but not adaptive; instance-level prompts are adaptive but inefficient. Prototype-based methods, such as PTP (Zhang et al., 2022), cluster image features to select $K$ prototypes $\{P_k\}$ and learn a corresponding set of prompt prototypes $\{T_k\}$ . For a given image $x$ , cluster soft assignments $sim(x, P_k)$ determine the weighted usage of each prompt:

$Prob(x, c) = \sum_{k=1}^K sim(x, P_k) \cdot Prob_{T_k}(x, c)$

addressing the trade-off between generalization and efficiency.

Chain-of-Thought and Program-based Demonstrations: For tasks like math word problem solving, using in-context exemplars that include reasoning traces—expressed as either programs or step-by-step CoT answers—and retrieving the most similar ones yields state-of-the-art accuracy (Jie et al., 2023). Answers are verified by program execution, providing robust instance-level validation.

4. Stability, Robustness, and Optimization

Few-shot prompting can suffer from high variance due to: prompt initialization, demonstration selection, and run-to-run idiosyncrasies. Several research efforts have introduced meta-strategies to address these challenges:

Ensembling: Ensembling predictions across multiple random runs and/or multiple prompt templates (ENSEMBLE_pred), or ensembling model parameters (ENSEMBLE_para), systematically reduces run-to-run variability and increases performance stability (Köksal et al., 2022).
Active Learning for Demonstration Selection: Data selection using prompt-pair KL divergence (measuring how much different prompts disagree), coupled with k-means clustering for diversity, ensures the chosen examples are both informative and representative (Köksal et al., 2022).
Contrastive Learning for Prompt Optimization: StablePT uses input separation between hard and soft prompts and a supervised contrastive loss to enforce class-aware representations, addressing issues of initialization noise and run instability (Liu et al., 30 Apr 2024):

$L_{CL} = - \frac{1}{b} \sum_{i=1}^b \mathbf{1}_{y_i = y_j} \log \left( \frac{\exp ( sim(H_{sp,i}, H_{sp,j}) / T ) } {\sum_k \exp( sim(H_{sp,i}, H_{sp,k}) / T ) } \right )$

with $H_{sp}$ denoting the soft prompt embedding and $T$ the temperature.

Policy Gradient and RL-Based Selection: RL-based frameworks (DP₂O (Li et al., 2023), POEM (Do et al., 14 Aug 2024)) use policy networks (small, parameter-efficient MLPs) and episodic memory to select or order prompts and few-shot examples for each query, optimizing on reward signals such as accuracy or uncertainty metrics. Example ordering in the prompt is chosen by maximizing an episodic-memory-weighted sum of observed rewards for similar past prompt configurations, with similarity computed using vector embeddings.

5. Task- and Domain-Agnostic Methodologies

Task-agnostic frameworks extend few-shot prompting effectiveness across diverse NLP and multimodal tasks without task-specific engineering:

Unified Prompt Tuning (UPT) (Wang et al., 2022) trains PLMs using a Prompt-Options-Verbalizer (POV) triple to condition the model on a wide variety of prompt forms and output options. An auxiliary knowledge-enhanced masked LLMing loss further improves generalizability, reducing sensitivity to prompt format and template.
Contextual Few-Shot Prompting with Retrieval (Kumar et al., 15 Mar 2025): For tasks such as Text-to-SQL, relevant demonstrations are retrieved dynamically based on embedding similarity (e.g., via FAISS and vector databases), integrating context adaptation with few-shot template prompting to improve both latency and accuracy, and facilitating robust deployment in systems with role-based access and data security constraints.
Cross-Lingual Few-Shot Prompting (Toukmaji, 9 Mar 2024): Direct prompting in the target (low-resource) language with in-context examples outperforms translation-based or additional language-adaptive pre-training approaches, preserves the PLM’s instruction-following capabilities, and avoids degradation from over-fine-tuning.

6. Advances in Few-Shot Prompting for Vision and Multimodal Tasks

Recent work extends few-shot prompting to visual and multimodal domains:

Knowledge Prompting for Video Action Recognition (Shi et al., 2022): Leverages large language-generated or automatically extracted action descriptions as “text proposals” to prompt a visual encoder (e.g., CLIP), turning matching scores into rich, transferable semantic representations for each video frame, followed by a temporal modeling network with convolutions and self-attention for sequence aggregation.
Semantic Prompts for Visual Feature Extraction (Chen et al., 2023): Incorporates text-based class or label embeddings into the visual Transformer at both spatial (via self-attention) and channel (via MLP modulation) dimensions, optimizing the feature extractor itself for class specificity—results show consistent 3–4% improvements in 1-shot image recognition benchmarks compared to classifier-head semantic fusion.
VQA Prompt Design and Chain-of-Thought Reasoning (Awal et al., 2023): Structured question templates, image-caption augmentation, and text-only in-context demonstrations together optimize few-shot VQA performance. Chain-of-thought reasoning and self-consistency-based answer selection are shown to require careful balancing to avoid accuracy loss.

7. Beyond Manual Engineering: Automated, Meta- and Synthetic Data Strategies

Several strategies reduce the need for hand-crafted prompts or curated datasets:

Prompt Pooling and RL-Matched Discrete Prompting (Li et al., 2023): GPT-4-driven multi-round dialogue alignment creates diverse, readable prompt sets, which are then optimized per-instance via policy gradient methods guided by entropy-based reward metrics.
Synthetic Data Generation via Prompting (Schmidt et al., 15 May 2024): LMs are prompted to produce additional labeled examples given a candidate context and answer entity (e.g., from NER), using strategies such as beam and nucleus sampling, followed by QA-consistency filtering. This approach narrows the gap with full-data regimes and generalizes across extractive QA benchmarks.

8. Key Takeaways and Outlook

The landscape of few-shot prompting strategies has advanced rapidly with innovations in automated prompt/template discovery, dynamic and context-aware demonstration selection, memory/retrieval-augmented generation, reinforcement learning-based optimization, and architectures that explicitly model stability and robustness. These strategies consistently outperform both conventional fine-tuning (under low-resource conditions) and even sophisticated in-context learning with very large models, when paired with mid-sized PLMs. Furthermore, meta-cognitive prompting (reflection and positive reinforcement) and pairwise-based frameworks (MetricPrompt) demonstrate that reframing tasks in alignment with the pre-training objectives of LMs can bolster few-shot performance even further.

A summary table organizing representative strategies, their mechanisms, and their distinguishing features:

Strategy/Class	Core Mechanism	Distinguishing Feature
Prompt-based fine-tuning (Gao et al., 2020)	MLM reformulation, softmax over [MASK], auto template/label mapping	Task-agnostic, dynamic demos, up to 30% gain
Multi-label automatic mapping (Wang et al., 2022)	Summed [MASK] label probabilities per class	Noise robustness, no manual engineering
Active ensemble/AL (Köksal et al., 2022)	Multiprompt ensembling, prompt-pair KL selection	Stability, reduced run variance
Prototype-based prompting (Zhang et al., 2022)	Latent clustering, prompt prototypes	Parameter- vs. instance-level adaptivity
Contrastive StablePT (Liu et al., 30 Apr 2024)	Input separation, contrastive loss	Robustness to prompt/init noise
RL/episodic prompt optimization (Li et al., 2023, Do et al., 14 Aug 2024)	Policy gradient, memory, reward-based ordering	Performance-driven prompt matching
Retrieval-augmented, contextual (Kumar et al., 15 Mar 2025)	Embedding-based retrieval for context	Scalability, improved accuracy/latency
Meta-cognition/reflective (Ji et al., 2023)	Model self-reflection, positive feedback	Improved mapping/generalization

Few-shot prompting continues to be a highly active research area, with significant opportunities for advancing its foundations, stability, and applicability across a growing range of modalities and domains.