Verbalized Confidence Scores in LLMs

Updated 5 March 2026

Verbalized confidence scores are natural language expressions that quantify a model's uncertainty alongside its predictions.
They are calibrated using metrics like ECE and Brier Score to ensure reported confidence aligns with empirical accuracy.
Applications include enhancing decision-making in RAG systems, dialogue agents, and other high-stakes, uncertainty-sensitive environments.

Verbalized confidence scores are scalar or categorical expressions of uncertainty produced in natural language alongside a model’s predictions, typically to communicate the system’s own estimate of probability of correctness. In LLMs and related neural architectures, these scores replace or supplement internal "white-box" uncertainty measures such as logits or entropy with interpretable, user-facing outputs such as "Confidence: 73%". Methods for eliciting, calibrating, and exploiting verbalized confidence have become central for reliable deployment in high-stakes applications, particularly as these scores serve as both an uncertainty quantification mechanism and a user interface element. Calibration, robustness, and grounding of confidence expression—especially under conditions of input noise, adversarial attacks, or strategic risk—are active areas of research.

1. Formulation and Generation of Verbalized Confidence

Verbalized confidence is most commonly implemented as a numeric probability (e.g., "I am 80% confident that...") or as a verbal descriptor mapped to a probability (e.g., "Highly likely" → 0.9) (Yang et al., 2024, Tian et al., 2023). In contemporary LLMs and agentic pipelines, the standard pattern is to prompt the model to output its answer and, in a distinct field, its confidence. Examples of prompt protocols include:

Closed-form scalar: "Confidence: [0-100]%"
Probability distribution: "Probability that each choice is correct: {A: 0.6, B: 0.3, ...}"
Linguistic descriptions: "How likely is it? Options: Almost certain, Highly likely, ..." (with mappings to [0,1])

Both open-weight and closed-weight models (e.g., Llama, GPT-4) support natural-language confidence generation via templated prompts (Yang et al., 2024, Sun et al., 2024, Ou et al., 27 Oct 2025).

In settings such as retrieval-augmented generation (RAG), verbalized confidence is defined as a model-predicted scalar $\hat c \in [0,1]$ jointly output with answer $\hat a$ , i.e., $(\hat a, \hat c) = f_\theta(q, P)$ , where $q$ is a query and $P$ contains evidence passages (Liu et al., 16 Jan 2026). In multi-turn or agentic workflows, confidence is usually recorded after all reasoning and tool use—a single final number reflecting cumulative epistemic uncertainty (Ou et al., 27 Oct 2025).

For classification or dialogue tasks, verbalized confidence often appears as a sub-token or explicit number in the output JSON or text blob; post-parsing yields the usable score (Sun et al., 2024). In all cases, the reported confidence is model-generated and not an internal probability unless specifically calibrated.

2. Calibration: Metrics, Losses, and Empirical Findings

Calibration asks that, over all instances where the model says "X% confident," empirical accuracy matches $X\%$ (Tian et al., 2023, Yang et al., 2024). The main evaluation metrics are:

Expected Calibration Error (ECE):

$\mathrm{ECE} = \sum_{m=1}^M \frac{|B_m|}{N} |\mathrm{acc}(B_m) - \mathrm{conf}(B_m)|$

where $B_m$ is a confidence bin, $\mathrm{acc}(B_m)$ the empirical accuracy, and $\mathrm{conf}(B_m)$ the mean model-reported confidence in $\hat a$ 0.

Brier Score:

$\hat a$ 1

proper for binary and multiclass calibration (Li et al., 26 Aug 2025, Ou et al., 27 Oct 2025, Wang et al., 2024).

Discrimination (AUROC):

The area under the ROC curve of confidence as a predictor of correctness.

In practice, vanilla prompting of LLMs yields significant overconfidence, especially on difficult tasks or under irrelevant/contradictory context (Liu et al., 16 Jan 2026, Yang et al., 2024). For example, off-the-shelf LLMs show ECE values of $\hat a$ 2 in noisy retrieval, with calibrated frameworks such as NAACL (Liu et al., 16 Jan 2026) reducing ECE by $\hat a$ 3 absolute. Standard prompt methods (e.g., "confidence: [0-100]%") are sensitive to wording, with more explicit instructions and few-shot examples yielding better calibration for large models (ECE $\hat a$ 4 in best cases) (Yang et al., 2024).

Calibration can be improved by:

Proper scoring rule-based fine-tuning (e.g., tokenized Brier loss in ConfTuner (Li et al., 26 Aug 2025))
Temperature scaling, typically applied post-hoc to verbalized distributions (Wang et al., 2024)
Contrastive objectives enforcing answer-conditioning (ADVICE (Seo et al., 13 Oct 2025))
Synthetic noise injection or distractor normalization (Liu et al., 16 Jan 2026, Wang et al., 29 Sep 2025)
Reinforcement learning for long-form generative calibration (LoVeC (Zhang et al., 29 May 2025))

Well-calibrated models facilitate selective abstention, cascaded fallback, and downstream uncertainty-aware decision making, although most state-of-the-art LLMs remain overconfident in absolute terms (Li et al., 26 Aug 2025, Seo et al., 13 Oct 2025, Wang et al., 12 Jan 2026).

3. Methodologies for Enhancing Verbal Confidence

Recent research has developed a variety of fine-tuning and inference-time strategies to improve verbalized confidence calibration and utility.

Supervised Fine-Tuning with Proper Scoring Rules: ConfTuner introduces a tokenized Brier loss, shown to be a proper scoring rule for discrete verbalized tokens, incentivizing truthful self-estimation (Li et al., 26 Aug 2025).
Noise-Aware Calibration: NAACL defines principled rules (conflict independence, noise invariance, parametric fallback) for how confidences should behave under contradictory, irrelevant, or absent retrievals; these rules guide supervised fine-tuning via Brier loss (Liu et al., 16 Jan 2026).
Distractor-Normalized Coherence (DINCO): To counter suggestibility bias, DINCO normalizes confidence on the main claim by aggregated confidence over self-generated distractors, thus penalizing overconfidence in the absence of unique content (Wang et al., 29 Sep 2025).
Steered Prompting and Consistency Aggregation (SteerConf): By systematic prompt steering (e.g., "be very cautious"/"be very confident") and aggregating confidence-consistency signals, models can mitigate both overconfidence and instability in calibration (Zhou et al., 4 Mar 2025).
Answer-Dependency Enforcement: ADVICE explicitly trains models to ground their confidence in their chosen answer, penalizing answer-independence through contrastive JSD and margin losses. This form of answer-groundedness significantly reduces overconfidence (Seo et al., 13 Oct 2025).
Long-Form Generation with RL: LoVeC applies on-policy and off-policy RL algorithms (GRPO, DPO) to train LLMs to append confidence tags per sentence, with rewards based on alignment to factuality labels provided by oracle verifiers (Zhang et al., 29 May 2025).

Design choices include format (scalar vs. categorical), prompt sequence (single- or multi-stage), use of chain-of-thought for deeper uncertainty exploration (Podolak et al., 28 May 2025, Wang et al., 18 Nov 2025), and whether confidence spans only the final answer or intermediate steps (Sun et al., 2024, Jang et al., 4 Jun 2025).

4. Robustness, Trustworthiness, and Pitfalls

Verbalized confidence remains vulnerable to several sources of unreliability:

Suggestibility and Surface Mimicry: Models may anchor confidence on prompt structure or input claims, not on genuine content evidence—a phenomenon exacerbated by overexposure to generic "confidence-cue" expressions during training (TracVC (Xia et al., 15 Jan 2026)). Larger models can overfit stylistic certainty rather than content-grounded uncertainty.
Adversarial Manipulation: Targeted perturbations (token substitutions, character-level bugs, prompt triggers) can significantly alter reported confidence without meaningfully changing the question, undermining reliability. Existing defense mechanisms such as perplexity filters, guardrails, and paraphrasing are largely ineffective (Obadinma et al., 9 Jul 2025).
Calibration Saturation and Overconfidence: Standard verbalized confidence methods often collapse to coarse, saturated predictions (e.g., 0.9 or 1.0), limiting utility as ranking signals or for thresholding (Wang et al., 29 Sep 2025). Methods like DINCO and multi-option verbalized probability distributions (Wang et al., 18 Nov 2025) increase granularity and allow finer thresholding.
Disconnect from Strategic Decision-Making: Even with well-calibrated verbalized uncertainty, LLMs may not leverage their own confidence to adjust policies under task-specific risk, such as abstaining when error costs rise—highlighted by large utility/regret gaps in risk-sensitive evaluation frameworks (RiskEval (Wang et al., 12 Jan 2026)).

Recommendations to address these pitfalls include adversarially-informed calibration objectives, content-grounded fine-tuning, hybrid elicitation combining multiple uncertainty signals, and explicit abstention or selective prediction heads (Obadinma et al., 9 Jul 2025, Xia et al., 15 Jan 2026, Wang et al., 12 Jan 2026).

5. Applications and Practical Considerations

Verbalized confidence is directly integrated into a spectrum of statistical, interactive, and multi-agent applications:

Information Retrieval and RAG Systems: Enables fall-back to parametric knowledge under noisy retrieval (NAACL (Liu et al., 16 Jan 2026)); mitigates overconfidence driven by retrieval noise.
Dialogue State Tracking and Conversational Agents: Used to flag uncertain slot-value assignments and trigger user confirmation or fallback (Sun et al., 2024).
Tool-Using and Agentic LLMs: Facilitates dynamic compute allocation and low-overhead test-time scaling in multi-turn scenarios (BrowseConf (Ou et al., 27 Oct 2025)).
Cascade Architectures: Powers automatic rejection routing, self-correction loops, or cost-sensitive fallback to stronger models, improving both performance and efficiency (Li et al., 26 Aug 2025).
Long-Form Content Generation: Sentence-level confidence tags support fine-grained hallucination detection and credibility assessment (Zhang et al., 29 May 2025).
ASR Systems: Word-level confidence from models such as Whisper enables robust downstream error filtering in speech-to-text (Aggarwal et al., 19 Feb 2025).

Implementers should attend to prompt design (explicit numerical probability preferred over linguistic gradations), enforce consistent answer/confidence formats, and apply reliability metrics (ECE, Brier, AUROC) on deployment distributions. Fine-tuned calibration (e.g., via ConfTuner) offers significant gains for critical deployments at low resource cost (Li et al., 26 Aug 2025).

6. Limitations, Open Problems, and Future Directions

Despite substantial progress, significant limitations and research directions remain:

Generalization to Open-Ended and Long-Form Tasks: Most results are for closed-form or short-answer scenarios; transfer to creative or subjective settings is not well-understood (Yang et al., 2024, Zhang et al., 29 May 2025).
Prompt Fragility and Format Adherence: LLMs frequently produce extraneous text, incorrect formats, or coarsely discretized values; supervised fine-tuning or prompt anchoring may mitigate these effects (Wang et al., 2024).
Calibration of Abstention and Action Policies: Current models’ confidence is not directly actionable for selective abstention under dynamic risk, necessitating joint training of uncertainty and policy heads (Wang et al., 12 Jan 2026).
Content Grounding: Increased focus is needed on aligning confidence generation mechanisms with content evidence rather than superficial stylistics or training frequency artifacts (Xia et al., 15 Jan 2026).
Multi-Model Uncertainty and Agent Collaboration: Integration of confidence signals in multi-agent or consensus settings is largely unexplored; combining verbalized signals with logit-based or entropy-based uncertainty remains an open research topic (Yang et al., 2024).
Adversarial Defenses and Monitoring: No effective adversarial defense exists for pure verbalized confidence; robust UQ may require continuous monitoring and hybridization with internal or probe-based uncertainty estimates (Obadinma et al., 9 Jul 2025).

Future work is likely to focus on richer, compositional calibration objectives, dynamic prompt adaptation, multi-turn and multi-agent confidence aggregation, and alignment of confidence-driven policies with real-world risk expectations.

References

(Tian et al., 2023): Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from LLMs Fine-Tuned with Human Feedback
(Yang et al., 2024): On Verbalized Confidence Scores for LLMs
(Sun et al., 2024): Confidence Estimation for LLM-Based Dialogue State Tracking
(Li et al., 26 Aug 2025): ConfTuner: Training LLMs to Express Their Confidence Verbally
(Zhou et al., 4 Mar 2025): SteerConf: Steering LLMs for Confidence Elicitation
(Seo et al., 13 Oct 2025): ADVICE: Answer-Dependent Verbalized Confidence Estimation
(Liu et al., 16 Jan 2026): NAACL: Noise-AwAre Verbal Confidence Calibration for LLMs in RAG Systems
(Zhang et al., 29 May 2025): Reinforcement Learning for Better Verbalized Confidence in Long-Form Generation
(Xia et al., 15 Jan 2026): Influential Training Data Retrieval for Explaining Verbalized Confidence of LLMs
(Wang et al., 29 Sep 2025): Calibrating Verbalized Confidence with Self-Generated Distractors
(Wang et al., 18 Nov 2025): Don't Miss the Forest for the Trees: In-Depth Confidence Estimation for LLMs via Reasoning over the Answer Space
(Obadinma et al., 9 Jul 2025): On the Robustness of Verbal Confidence of LLMs in Adversarial Attacks
(Aggarwal et al., 19 Feb 2025): Adopting Whisper for Confidence Estimation
(Ou et al., 27 Oct 2025): BrowseConf: Confidence-Guided Test-Time Scaling for Web Agents
(Podolak et al., 28 May 2025): Read Your Own Mind: Reasoning Helps Surface Self-Confidence Signals in LLMs
(Wang et al., 12 Jan 2026): Are LLM Decisions Faithful to Verbal Confidence?
(Wang et al., 2024): Calibrating Verbalized Probabilities for LLMs
(Jang et al., 4 Jun 2025): Verbalized Confidence Triggers Self-Verification: Emergent Behavior Without Explicit Reasoning Supervision