Natural-Language Verbalized Rejection Sampling

Updated 22 April 2026

Natural-Language VRS is a method that uses explicit natural language prompts to implement rejection sampling protocols with LLMs for unbiased probabilistic outcomes.
It replaces direct sampling with a dual-step process where candidate proposals and verbalized acceptance criteria jointly reduce sampling bias.
Empirical results indicate that VRS can reduce total-variation distance by over 50% across models, enhancing calibration and reliability.

A Natural-Language Verbalized Rejection Sampling (VRS) system harnesses LLMs to implement classical probabilistic rejection sampling protocols entirely through natural-language prompts. By verbalizing the accept/reject logic of rejection sampling in prompts, VRS enables LLMs to produce unbiased stochastic samples from specified probability distributions without access to internal logits or sampling mechanisms. This approach has direct implications for applications requiring reliable, controllable stochasticity—including Monte Carlo simulation, agent-based modeling, and randomized protocols for AI systems—where the naive prompting of an LLM may otherwise introduce significant sampling bias due to the model’s output distribution or tokenization artifacts (Xiao et al., 11 Jun 2025).

1. Formal Definition and Problem Context

The central problem VRS addresses is the alignment of an LLM’s empirical output distribution with a target distribution specified by the user. Consider the canonical example of Bernoulli sampling: for $p \in [0,1]$ , one seeks independent samples $X_i \sim \mathrm{Bernoulli}(p)$ via natural-language instructions to the LLM. In practice, prompting even advanced models yields measurable deviation $\varepsilon = \bigl| \Pr_{\rm LLM}[X = 1] - p \bigr|$ from the target bias. VRS builds on standard rejection sampling but adapts all reasoning and decision-making (including the proposal, acceptance criterion, and accept/reject step) into explicit natural-language prompts (Xiao et al., 11 Jun 2025).

2. Core Methodology: Verbalized Rejection Sampling Algorithm

A VRS system replaces naive direct sampling prompts with a two-step protocol:

Proposal Sampling: The LLM is prompted to generate a candidate sample $x$ according to a known proposal distribution $Q$ . For Bernoulli sampling, $Q$ is typically $\mathrm{Bern}(q)$ with $q=0.5$ .
Acceptance Criterion (Verbalized): The LLM is then prompted, in natural language, to compute the acceptance probability $A(x) = P(x)/(M Q(x))$ , where $P$ is the target distribution, $X_i \sim \mathrm{Bernoulli}(p)$ 0 the proposal, and $X_i \sim \mathrm{Bernoulli}(p)$ 1 the rejection constant. The prompt requires the LLM to reason step by step, compute $X_i \sim \mathrm{Bernoulli}(p)$ 2, and make an explicit accept/reject ('T' or 'F') decision.

The algorithm can be summarized (with prompt examples abstracted for technical clarity):

$\varepsilon = \bigl| \Pr_{\rm LLM}[X = 1] - p \bigr|$ 7

All mathematical operations and rationale for the decision can be in prompt-chain-of-thought format, as enforced in empirical studies (Xiao et al., 11 Jun 2025).

3. Theoretical Properties, Bias and Error Analysis

The key distinction between direct sampling and VRS lies in bias correction. Direct sampling simply prompts for a value according to $X_i \sim \mathrm{Bernoulli}(p)$ 3 but inherits the LLM’s inherent sampling bias. In contrast, VRS structures the acceptance step so that, under mild assumptions about bounded bias $X_i \sim \mathrm{Bernoulli}(p)$ 4 in the LLM’s rejection step, the total-variation distance to the target distribution is strictly reduced:

$X_i \sim \mathrm{Bernoulli}(p)$ 5

where $X_i \sim \mathrm{Bernoulli}(p)$ 6 is the realized law from the VRS procedure (Xiao et al., 11 Jun 2025). Whenever $X_i \sim \mathrm{Bernoulli}(p)$ 7, VRS outperforms direct sampling in total-variation error. This result generalizes classical rejection-sampling guarantees to the LLM-as-black-box setting, accommodating the fact that all computations are expressed and executed in natural language.

4. Prompt Engineering and Empirical Results

Prompt design is central. Empirical studies explored the impact of various prompt phrasings (e.g., explicit $X_i \sim \mathrm{Bernoulli}(p)$ 8, $X_i \sim \mathrm{Bernoulli}(p)$ 9, balanced, or both) on the sum of TV distances (STVD) across models. For Llama-3.1-70B, GPT-4.1-nano, DeepSeekV3, and Qwen-2.5-72B, the best VRS prompt phrasing reduced STVD by over 50% compared to direct sampling (e.g., Llama-3.1, direct: $\varepsilon = \bigl| \Pr_{\rm LLM}[X = 1] - p \bigr|$ 0, VRS: $\varepsilon = \bigl| \Pr_{\rm LLM}[X = 1] - p \bigr|$ 1) (Xiao et al., 11 Jun 2025). Calibration plots showed that VRS outputs cluster near the ideal 45-degree calibration line, whereas direct sampling consistently deviates (overestimation or underestimation depending on $\varepsilon = \bigl| \Pr_{\rm LLM}[X = 1] - p \bigr|$ 2).

Prompt specification best practices include explicit step-by-step computation, strict output format requirements, supplying rejection constants in-prompt if needed, and minimizing “chain-of-thought” verbosity to maximize LLM compliance and efficiency.

Model	Direct Sampling (STVD)	VRS (Best Phrasing, STVD)
Llama-3.1	15.73	6.2
GPT-4.1-nano	21.00	6.2–6.6
DeepSeekV3	20.30	6.66
Qwen-2.5	20.27	5.47

5. Extensions, Limitations, and Generalizations

VRS generalizes naturally to categorical targets, where $\varepsilon = \bigl| \Pr_{\rm LLM}[X = 1] - p \bigr|$ 3 over $\varepsilon = \bigl| \Pr_{\rm LLM}[X = 1] - p \bigr|$ 4 classes. It can, in principle, be extended to continuous distributions by discretizing the event space and scripting the LLM to reason about real-valued densities, although this remains challenging. Efficiency deteriorates as $\varepsilon = \bigl| \Pr_{\rm LLM}[X = 1] - p \bigr|$ 5 increases (corresponding to low acceptance rates), and the procedure relies on the LLM’s ability to track and compare numeric ratios correctly.

Limitations include:

Dependence on the LLM’s numerical and logical reasoning fidelity within the natural-language interface.
No current open-source implementation of VRS for high-dimensional or continuous distributions.
Theoretical analysis fully developed only for Bernoulli targets (Xiao et al., 11 Jun 2025).

Suggested future directions include verbalized MCMC protocols (verbal chain-of-thought for Markov chains), integrating external random functions (e.g., code tool calls), and adaptive proposal selection where the LLM actively refines $\varepsilon = \bigl| \Pr_{\rm LLM}[X = 1] - p \bigr|$ 6.

6. Applications and Broader Implications

VRS fundamentally augments the stochastic reliability of LLM-driven simulations, agent-based reasoning, and randomized protocols where unbiased sampling is critical but direct access to the LLM’s logits is unavailable. It enables more controllable and trustworthy AI decision workflows and can serve as a blueprint for further embedding traditional algorithmic primitives into natural-language-driven modeling (Xiao et al., 11 Jun 2025).

The empirical reductions in bias and the framework’s adaptability across models (without prompt over-engineering or model internals access) make VRS a standard technique for LLM-based stochastic computation. As classical “algorithm verbalization” gains prominence, VRS stands as a paradigmatic instance of probabilistic methodology successfully transferred to the LLM-centric pipeline.

Markdown Report Issue Upgrade to Chat

References (1)

Flipping Against All Odds: Reducing LLM Coin Flip Bias via Verbalized Rejection Sampling (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Natural-Language VRS.