Verbalized Probability Distribution
- Verbalized probability distributions are mappings from natural language expressions to probability values that bridge internal model states and human-interpretable outputs.
- They use calibration techniques like temperature scaling and the invert-softmax trick to reduce overconfidence and improve reliability.
- Applications include eliciting model uncertainty, modeling human subjective confidence with fuzzy sets, and enhancing decision-making with better uncertainty estimates.
A verbalized probability distribution is a mapping from linguistic expressions or explicit model outputs in natural language (e.g., confidence scores, ranges, or categorical labels) to probability values or, more generally, to probability distributions over outcomes. This construct serves as a bridge between internal model states (typically unobservable token-wise probabilities or hidden logits) and the external, human-interpretable confidence or uncertainty statements produced by black-box models or human subjects. In contemporary research, the idea of verbalized probability distributions is central to (i) eliciting well-calibrated uncertainty estimates from LLMs via prompt-based methods, (ii) calibrating or regularizing these distributions for decision-making, and (iii) modeling human subjective uncertainty using categorical verbal expressions formalized as fuzzy sets.
1. Formal Definition and Elicitation Methods
Let denote a discrete outcome space (e.g., sentiment labels, multiple-choice answers). A verbalized probability distribution over is a vector where , , and each is either (a) directly output by a LLM in response to a prompt (“Please provide a probability for each class”), or (b) inferred from a verbal or categorical statement (“high confidence,” “likely,” etc.) using a pre-defined mapping .
The elicitation workflow for LLMs is as follows (Wang et al., 9 Oct 2024):
- Prompt the model to supply a full distribution, e.g., as a JSON/Python dict , ensuring and .
- Alternatively, elicit scalar confidences (“probability score,” percentage, graded label) per answer, then aggregate or calibrate these into a full or partial distribution (Yang et al., 19 Dec 2024).
- For human subjects, verbal expressions (e.g., “unlikely,” “possible,” “certain”) are mapped to probability intervals or membership functions via calibration procedures (Zimmer, 2013).
A fundamental assumption in the LLM setting is that the observed is an explicit proxy for the model’s internal categorical softmax over logits: .
2. Calibration, Post-Processing, and the Invert-Softmax Trick
Verbalized probability distributions are frequently overconfident and poorly calibrated—average confidence typically exceeds empirical accuracy (Wang et al., 9 Oct 2024, Yang et al., 19 Dec 2024, Lin et al., 2022). Calibration remedies include:
- Temperature Scaling (TS): Let be logits (normally hidden); then . For verbalized probabilities (i.e., no access to ), naïvely applying TS as leads to a “re-softmax” phenomenon, producing higher-entropy (flattened) distributions, strictly bounded between $1/(K-1+e)$ and , thereby artificially lowering confidence and distorting class probabilities (Wang et al., 9 Oct 2024).
- Invert-Softmax Trick: Provided only verbalized probabilities , recover proxy logits , yielding an affine-equivalent logit vector. This operation preserves class ordering and enables downstream TS or Platt scaling to be applied correctly (Wang et al., 9 Oct 2024).
- Post-Hoc Calibration (TS/Platt): Fit (for TS) or regression coefficients (for Platt) on a held-out set to minimize negative log-likelihood or expected calibration error.
Empirical findings indicate that invert-softmax followed by TS yields optimal (sharpened, more realistic confidence curves), reduces ECE by 30–50%, and aligns average confidence with accuracy across datasets of differing cardinality (e.g., IMDB/K=2: ECE 3.7%→3.1%; MASSIVE/K=60: ECE 14.5%→7.2%) (Wang et al., 9 Oct 2024).
3. Representation Formats: Scalars, Categorical Labels, and Fuzzy Sets
Verbalized probability distributions exist both as explicit numeric vectors and as mappings from natural language categories:
- Numeric scores (LLMs): Probabilities can be supplied as real numbers (0–1), percentages (0–100%), or probability bins. Prompt template selection and response format substantially control calibration quality (Yang et al., 19 Dec 2024).
- Categorical/verbal labels (Humans & LLMs): Human studies represent subjective uncertainty as a family of fuzzy sets , each corresponding to a verbal label (“likely,” “unlikely,” etc.), typically modeled as trapezoidal membership functions. For a subject, the elicitation/calibration protocol estimates the four-parameter support of each , then uses nearest-label or mixture decoding for direct mapping (Zimmer, 2013).
- Possibility distributions: In possibility theory, a subject’s (or model’s) uncertainty is represented as a possibility density , yielding a fuzzy analogue of a probability distribution.
Empirical work demonstrates pronounced variance and asymmetry in interpretation for verbal probability phrases, with only extreme endpoints (e.g., “always,” “never,” “certain,” “impossible”) being consistently mapped across individuals and contexts (Willems et al., 2019).
4. Verbalized Probability Distributions in LLMs: Confidence, Knowledge Boundaries, and Diversity
LLMs can be prompted to output their own uncertainty estimates, either as class distributions, scalar confidences, or confidence buckets:
- Verbalized Confidence Scores: Prompt engineering (e.g., explicitly requesting “probability that your answer is correct”) can robustly elicit scalar confidences, with advanced instructions and few-shot examples further reducing calibration error for larger models to ECE ≈0.07 (Yang et al., 19 Dec 2024).
- Knowledge Boundary Estimation: Probabilistic (token-wise log-probabilities) and verbalized (natural-language self-report) confidence estimates diverge: probabilistic signals are quantitatively superior (better alignment, lower overconfidence), but verbalized outputs are more direct and require no in-domain threshold tuning. Fine-grained verbal scales (e.g., 3–5 point confidence buckets) marginally improve calibration but remain less granular than logit-derived probabilities (Ni et al., 19 Aug 2024).
- Diversity and Generative Distribution: Verbalized Sampling (VS) prompts LLMs to output a full distribution over responses (e.g., list of candidate poems, each with a verbalized probability), enabling inference-time recovery of the model’s pre-alignment generative diversity and mitigating mode collapse (Zhang et al., 1 Oct 2025).
| Format | Source Type | Distribution Representation |
|---|---|---|
| Categorical probs | LLM | |
| Scalar confidence | LLM/Human | or percent |
| Verbal category | Human/LLM | Fuzzy set / bin ( interval) |
5. Verbalized Probabilistic Graphical Modeling and Compositional Reasoning
Verbalized probability distributions also appear as core primitives in higher-order reasoning settings:
- Natural-language Bayesian networks: Latent factors (e.g., “Relevance,” “Knowledge Quality”) and observed evidence are formalized as nodes, with conditional distributions and elicited in stepwise natural language, then numerically combined to obtain marginal or conditional distributions (Huang et al., 8 Jun 2024).
- Sampling and Simulation: Verbalized Rejection Sampling operationalizes classic sampling schemes (e.g., for Bernoulli variables) in natural language, allowing LLMs to simulate stochastic procedures and output faithful samples according to the target distribution, empirically reducing bias by more than 50% in calibrated metrics (Xiao et al., 11 Jun 2025).
6. Human Interpretation: Variability and Fuzzy Partitioning
Human use of verbal probability expressions is characterized by substantial excess variance and asymmetry:
- Population studies show that for most verbal phrases, interquartile ranges span 30–40 percentage points, except at extreme ends (e.g., “always,” “never”). Non-complementary mapping of opposites (e.g., “likely” + “unlikely” ≠ 1) is universal (Willems et al., 2019).
- Sequential calibration of an individual’s phrase set via empirical tasks allows the construction of a personalized fuzzy partition of , forming a pseudo-probability density that captures subject-specific semantics (Zimmer, 2013).
- Fuzzy arithmetic and possibility theory support the propagation and combination of these verbalized distributions through logical conjunctions and hypothesis chains, making the entire reasoning process transparent and psychometrically well-founded.
7. Limitations, Current Challenges, and Future Directions
A number of open challenges remain:
- Calibration under Shift: Both LLMs and humans exhibit degradation in calibration when subject to content or domain shift (e.g., math to history, unique to multi-answer questions). Calibration curves become less reliable, and existing mappings may need retraining or fine-tuning (Lin et al., 2022).
- Mapping Granularity: Current mapping schemes (discrete bins, scalar percentages) sacrifice expressiveness for ease of parsing; more sophisticated, smoother mappings (e.g., via scoring-rule-based RLHF) are underexplored (Lin et al., 2022, Yang et al., 19 Dec 2024).
- Inter-subject and Contextual Variability: Individual and context-driven variability constrains the reliability of pure verbal expressions. Mixed verbal-numeric reporting, or explicit calibration of verbal scales, is recommended for high-stakes applications (Willems et al., 2019).
- Prompt Optimization: Automated search over prompt templates, as well as direct model fine-tuning or RLHF for verbatim calibration, are identified as promising directions to close the gap between white-box (token-probability) and black-box (verbalized) uncertainty reporting (Yang et al., 19 Dec 2024).
In summary, verbalized probability distributions provide a unified methodological and analytical substrate for quantifying, communicating, and post-processing uncertainty in both human and machine reasoning. Advances in prompt engineering, calibration strategies, and formalization continue to increase the practical utility and reliability of these constructs in LLM-based systems and human-machine interfaces (Wang et al., 9 Oct 2024, Yang et al., 19 Dec 2024, Lin et al., 2022, Xiao et al., 11 Jun 2025, Huang et al., 8 Jun 2024, Zhang et al., 1 Oct 2025, Zimmer, 2013, Willems et al., 2019, Ni et al., 19 Aug 2024).