Verbalized Sampling (VS) in LLMs

Updated 19 March 2026

Verbalized Sampling (VS) is a method that prompts language models to enumerate candidate outputs with explicit probability scores, addressing mode collapse and sampling bias.
It employs structured prompts to generate multiple responses paired with numerical confidence estimates, enabling calibrated probabilistic reasoning and diverse generation.
Empirical results demonstrate that VS improves semantic diversity, calibration, and output fairness, making it useful for creative tasks and probabilistic simulations.

Verbalized Sampling (VS) is a prompting-based approach for LLMs that elicits explicit probability distributions or confidence scores over sets of candidate outputs, directly in natural language. Unlike traditional direct sampling, which simply decodes top sequences under the model distribution, VS instructs the LLM to enumerate possible options and to verbalize a numerical probability, confidence, or plausibility score for each. This methodology is designed to combat issues such as mode collapse and sampling bias introduced during post-training alignment, and to support calibrated probabilistic reasoning and diverse generation across various tasks (Zhang et al., 1 Oct 2025, Wang et al., 29 Sep 2025, Xiao et al., 11 Jun 2025).

1. Theoretical Motivation and Origins

A central theoretical motivation for VS lies in mitigating typicality bias and mode collapse, phenomena emerging from post-training preference optimization (such as RLHF and RLVR). Empirical and cognitive findings reveal a strong tendency of annotators and reward models to favor “typical” or familiar outputs, which shifts the induced policy toward the reference model and amplifies output concentration. Formally, if reward is modeled as

$r(x, y) = r_{\text{true}}(x, y) + \alpha \log \pi_{\text{ref}}(y \mid x) + \epsilon(x),$

then the optimal post-RLHF policy is

$\pi^*(y \mid x) \propto \pi_{\text{ref}}(y \mid x)^\gamma \exp\left(\frac{r_{\text{true}}(x, y)}{\beta}\right),\quad \gamma=1+\frac{\alpha}{\beta}>1$

(Zhang et al., 1 Oct 2025). This results in mode collapse when many outputs tie under $r_{\text{true}}$ .

VS circumvents this by separating distribution learning from preference and requiring the LLM to articulate and assign probabilities to multiple candidates simultaneously, thereby diluting typicality bias and enabling recovery of pre-trained diversity. The approach also supports probabilistic calibration in reasoning and decision-making, addressing overconfidence and suggestibility in confidence estimation (Wang et al., 29 Sep 2025).

2. Core Methodologies of Verbalized Sampling

The VS paradigm is operationalized via explicit prompt engineering that transforms single-response tasks into groupwise or distributional queries. Key methodologies include:

Distribution-level Prompting: Rather than requesting a single output, the LLM is prompted, e.g., “Generate 5 jokes about coffee and their probabilities.” Each response is paired with a verbalized probability, confidence score, or plausibility judgment. Common output formats use JSON arrays with fields for text and probability, with normalization performed post hoc if necessary (Zhang et al., 1 Oct 2025).
Probability Elicitation: Probabilities may be explicitly requested (e.g., as decimals or percentages), in terms of verbalizable confidence (“How likely is this answer correct, as a percent?”), or via binary judgments interpreted probabilistically (Wang et al., 29 Sep 2025).
Distractor-based Calibration: The DiNCo (Distractor-Normalized Coherence) protocol generates task-relevant distractors alongside the main claim and calibrates the verbalized confidence of the main claim by normalizing it against plausibility scores for the distractors, optionally reweighting based on semantic uniqueness and logical incompatibility (Wang et al., 29 Sep 2025).
Rejection-based Sampling: Verbalized Rejection Sampling (VRS) adapts classical rejection sampling by prompting the LLM to accept or reject candidate samples based on explicit reasoning about proposal and target probabilities, reducing sampling bias and improving statistical faithfulness, especially in stochastic simulation tasks (Xiao et al., 11 Jun 2025).

3. Algorithmic Implementations and Inference Recipes

VS protocols are realizable in both “white-box” and “black-box” settings:

White-box (logit access/beam search): The LLM is used to decode high-probability completions and alternative beams, with candidate sets constructed via beam search or prefix manipulation. Verbalized confidences are elicited and normalized as per the relevant protocol (e.g., DiNCo).
Black-box (prompt-based): Candidate responses and distractors are generated via repeated prompting, often with temperature or nucleus sampling for diversity. Confidence or probability statements are solicited in pure text, parsed, and aggregated post hoc (Wang et al., 29 Sep 2025, Zhang et al., 1 Oct 2025).

Pseudocode paradigms across VS instances share common steps:

Generate candidate responses using prompts tailored to the specific task (e.g., QA, creative writing).
For each candidate, elicit a model-generated probability or confidence score using an explicit instruction.
Normalize or reweight scores as needed (e.g., ensure probabilities sum to one, discount redundant or entailed distractors).
Sample from, or aggregate over, the probability-weighted set for downstream use, such as selecting the most likely or a diverse set of responses.

In probabilistic reasoning and calibration tasks (DiNCo), further integration with self-consistency sampling and natural language inference (NLI) networks may be used to blend confidence derived from both generation and validation distributions, producing a calibrated final estimate (Wang et al., 29 Sep 2025).

4. Empirical Results and Evaluation Metrics

VS demonstrates improvements across several evaluation axes, including diversity, calibration, and bias reduction:

Diversity Gains: In creative writing, dialogue, and QA, VS increases both semantic and lexical diversity by 1.6–2.1× relative to direct prompting. Semantic diversity is quantified via average pairwise embedding cosine similarity, with higher values indicating greater output variety (Zhang et al., 1 Oct 2025).
Calibration and Bias: In stochastic tasks (e.g., coin flipping, random number generation), VRS achieves KL divergence from the target or uniform distribution as low as 0.027 (vs. 0.926 direct), and reduces sum of TV distances (STVD) by 48–73% across models (Xiao et al., 11 Jun 2025).
Confidence Calibration: DiNCo achieves an Expected Calibration Error (ECE) of approximately 0.097 (vs. ~0.24 for non-VS baselines) and demonstrates lower saturation in confidence assignments (Δ₀=0.998), indicating finer discrimination between output probabilities (Wang et al., 29 Sep 2025).
Task Quality: VS maintains or modestly improves downstream accuracy and human-judged quality, while enhancing diversity and fair coverage across knowledge and synthetic data tasks (Zhang et al., 1 Oct 2025).

5. Case Studies: DiNCo and Verbalized Rejection Sampling

DiNCo (Distractor-Normalized Coherence) exemplifies VS for answer calibration in LLMs. Main steps include: generating distractors (alternative, plausible but incorrect claims); eliciting and normalizing verbalized confidence across the candidate set (with semantic and logical downweighting for entailed or non-contradictory alternatives); integrating “validator” and “generator” consistency via self-consistency sampling; and outputting a blended, calibrated confidence (Wang et al., 29 Sep 2025).

VRS (Verbalized Rejection Sampling) applies VS to unbiased sampling for distributions (notably Bernoulli and categorical). The LLM is prompted to reason about accept/reject steps by referencing explicitly stated target and proposal probabilities and outputs an accept/reject verdict. Empirically, VRS achieves substantial reduction in sampling bias and TV distance versus direct sampling, demonstrating near-diagonal alignment with the true target distribution (Xiao et al., 11 Jun 2025).

6. Limitations, Practical Guidelines, and Extensions

VS introduces a compute and latency trade-off due to its multi-response paradigm (e.g., “k” responses per query). Its benefits scale with model capability: larger, more capable models with richer knowledge benefit more, while smaller models may struggle to produce well-calibrated or meaningfully diverse distributions (Zhang et al., 1 Oct 2025). Open questions remain regarding optimal probability formats, diversity–quality trade-off tuning, and extension to multimodal or highly structured tasks.

Practical recommendations include:

Use “VS-Standard” prompts (“Generate $k$ candidates with probabilities”) for creative and QA tasks.
For tasks prioritizing calibration, use distractor-based normalization protocols.
Normalize all self-reported probabilities and reweight or sample accordingly per downstream needs.
Combine VS with existing decoding strategies (temperature, top-p) for maximal effect.

A plausible implication is that inference-time VS methods provide a new axis for controllable diversity, calibration, and bias correction in LLM generation pipelines, distinct from and complementary to algorithmic or model-level interventions.

7. Relation to Broader Research and Future Directions

VS situates itself at the intersection of post-training alignment, calibrated reasoning, sampling-based diversity, and cognitive modeling of LLM evaluation. It provides a data-centric, training-free mechanism for addressing mode collapse, aligning natural language capabilities with probabilistic reasoning, and enabling new downstream applications in random simulation, creative generation, and trustworthy AI (Zhang et al., 1 Oct 2025, Wang et al., 29 Sep 2025, Xiao et al., 11 Jun 2025).

Future directions include systematic fairness and bias analyses, optimizing protocol-specific hyperparameters, and extending VS to multimodal domains and more complex structured output spaces. Ongoing research explores its potential in reward model calibration and as an inference-time instrument for improved RL sample efficiency.