Verbalized Confidence in AI Models

Updated 3 July 2026

Verbalized confidence is a natural-language estimate of an AI’s self-assessed reliability expressed as numeric or ordinal scores.
It plays a critical role in calibration by aligning predicted confidence with empirical accuracy across various models and modalities.
Research focuses on improving calibration through methods like supervised fine-tuning, proper scoring rules, and mechanistic activation steering.

Verbalized confidence is the explicit, natural-language estimate of an AI model’s own reliability, output as a probability or ordinal score alongside its main answer. It serves as a user-facing signal of uncertainty and is increasingly central to the deployment, evaluation, and self-assessment capabilities of large language and vision-LLMs. Verbalized confidence spans use cases from single-turn question answering to long-form generation, and from multi-turn web agents to multimodal systems, with calibration and interpretability being core challenges.

1. Definitions, Elicitation, and Formalization

Verbalized confidence is defined as a model’s natural-language or structured output estimating its subjective probability that its answer is correct. Prompts range from direct numerical queries—“I am X% confident…” (Ou et al., 27 Oct 2025, Li et al., 26 Aug 2025)—to ordinal or categorical spans (e.g., “certain”/“uncertain” (Ni et al., 2024, Ding et al., 26 Aug 2025)), and, in some cases, to word- or token-level scores in tasks such as machine translation (Marashian et al., 15 Jun 2026). Verbalized confidence may be elicited as:

Numeric scores on an integer or percentage scale (commonly [0,100], but with evidence that [0,20] improves metacognitive efficiency (Dai, 10 Mar 2026)).
Qualitative scales (“not certain” to “very certain”) mapped onto numeric intervals (Marashian et al., 15 Jun 2026).
Distributions over multiple answer candidates (verbalized probability distributions) (Wang et al., 18 Nov 2025).

In vision-language or multimodal contexts, verbalized confidence typically refers to either binary labels (e.g., “certain”/“uncertain”), explicit probability estimates, or numeric scores accompanying object-centric or holistic answers (Zhao et al., 21 Apr 2025, Ding et al., 26 Aug 2025, Xuan et al., 26 May 2025). Elicitation strategies include augmenting answer prompts, two-step self-assessment, or more elaborate multi-stage pipelines (e.g., Visual Confidence-Aware Prompting (Xuan et al., 26 May 2025)).

2. Calibration, Metrics, and Empirical Observations

Calibration refers to the alignment between the stated confidence and empirical correctness: for example, if a model claims to be 80% confident on a set of queries, it should be correct about 80% of the time. Standard evaluation metrics for verbalized confidence include:

Expected Calibration Error (ECE):

$\mathrm{ECE} = \sum_{m=1}^M \frac{|B_m|}{N}|\mathrm{acc}(B_m) - \mathrm{conf}(B_m)|$

where $B_m$ are bins by predicted confidence, and $\mathrm{acc}(B_m)$ , $\mathrm{conf}(B_m)$ are bin accuracy and average confidence, respectively (Li et al., 26 Aug 2025, Ou et al., 27 Oct 2025, Wang et al., 18 Nov 2025, Xuan et al., 26 May 2025, Miao et al., 26 Mar 2026, Zong et al., 28 Oct 2025, Li et al., 12 May 2026).

Brier Score: quadratic penalty between predicted confidence and actual correctness (Li et al., 26 Aug 2025, Seo et al., 13 Oct 2025, Zhao et al., 21 Apr 2025, Wang et al., 18 Nov 2025).
Area Under the ROC Curve (AUROC): discrimination between correct and incorrect answers based on confidence (Wang et al., 18 Nov 2025, Ou et al., 27 Oct 2025, Li et al., 26 Aug 2025).
Metacognitive efficiency (meta- $d'$ ): compares type-2 signal detection sensitivity (how well confidence separates correct from incorrect answers) to type-1 accuracy (Dai, 10 Mar 2026); higher values indicate more informative verbalized confidence.

Other relevant metrics include alignment (fraction of cases in which stated confidence matches correctness), overconfidence (rate of incorrect yet “confident” predictions), and conservativeness (rate of correct yet “uncertain” outputs) (Ni et al., 2024, Ding et al., 26 Aug 2025).

Table: Representative Calibration Results

Model/Domain	Baseline ECE	Calibrated ECE	AUROC (if reported)	Ref.
LLaMA-3.1-8B QA	0.4803	0.0405	0.6884 → 0.7383	(Li et al., 26 Aug 2025)
Qwen-VL (VQA, POPE)	0.5699	0.4225	AUC up +0.05–0.10	(Zhao et al., 21 Apr 2025)
Mistral-7B QA	0.6767	0.1027	0.5198 → 0.7907	(Li et al., 26 Aug 2025)
Llama-3-8B (MMLU)	0.170	0.025 (ORCE)	0.034 → 0.477 (Spearman)	(Li et al., 12 May 2026)

Verbalized confidence is almost universally overconfident in the absence of explicit calibration. For instance, on multi-turn web search, the highest self-reported confidence bins may have actual accuracies of only 53–60% (Ou et al., 27 Oct 2025). Overconfidence is especially acute for binary or discretized scales and in vanilla settings (Ni et al., 2024, Ding et al., 26 Aug 2025, Marashian et al., 15 Jun 2026).

3. Mechanistic Analysis and Causes of Miscalibration

Multiple studies dissect the internal mechanisms underlying verbalized confidence and its misalignment with correctness:

Reasoning–Confidence Decoupling: Mechanistic interpretability analyses reveal that the internal direction encoding true accuracy is nearly orthogonal to that used for verbalized confidence (Miao et al., 26 Mar 2026). Prompting for confidence during or after reasoning (chain-of-thought) can induce a “Reasoning Contamination Effect,” where the act of reasoning further disconnects the verbalized confidence from the model’s internal uncertainty.
Overconfidence Circuits: Causal tracing identifies compact “confidence-inflation” circuits in mid–late transformer layers whose activations specifically bias the model toward reporting high confidence, especially on incorrect answers (Zhao et al., 1 Apr 2026). Targeted ablation or steering of these modules substantially mitigates overconfidence—reducing ECE by up to 85–97% in some cases.
Answer-Independence: LLMs often verbalize confidence independently of their generated answer (P(C | q, a) ≈ P(C | q)), leading to confidence scores that do not meaningfully differentiate between correct and incorrect responses (Seo et al., 13 Oct 2025).
Token/Achor Bias: Confidence reporting is strongly shaped by frequent anchor tokens (e.g., 50, 70, 100), with 78–92% of responses clustering on these values under a [0,100] scale (Dai, 10 Mar 2026). Compressing or shifting the confidence scale alters metacognitive efficiency, but bias towards certain round numbers persists even under non-standard or irregular ranges.

4. Methodologies for Improving Calibration

Several approaches have been developed to address verbalized confidence miscalibration:

Supervised Fine-Tuning (SFT): Training with ground-truth or proxy confidence labels, sometimes derived from self-consistency across sampled generations, effectively improves calibration for single- and multi-turn tasks (Li et al., 26 Aug 2025, Jang et al., 4 Jun 2025, Seo et al., 13 Oct 2025).
Proper Scoring Rules: ConfTuner introduces the tokenized Brier score, a discrete proper scoring rule on the set of output confidence tokens, with strong theoretical and empirical calibration properties (Li et al., 26 Aug 2025).
Order-Aware Reinforcement Learning: ORCE decouples answer and confidence generation and optimizes rank-based objectives (e.g., Spearman correlation between confidence and estimated correctness) via direct preference optimization, achieving state-of-the-art calibration and failure prediction without degrading task accuracy (Li et al., 12 May 2026).
Distributional Reasoning: Eliciting verbalized confidence as a full probability distribution over alternatives (instead of a single score) improves calibration, depth of reasoning, and alignment with human uncertainty analysis, particularly on complex or ambiguous tasks (Wang et al., 18 Nov 2025).
Critique-Based Training: CritiCal fine-tunes models on teacher-generated natural language critiques of confidence, surpassing both self-critique baselines and zero-shot methods, and even its own teacher model (GPT-4o) in complex tasks (Zong et al., 28 Oct 2025).
Distractor-Normalized Coherence (DiNCo): By normalizing verbalized confidence across mutually exclusive distractors and blending this with self-consistency, DiNCo delivers finer-grained, less saturated, and better-calibrated confidence estimates, outperforming both vanilla and sampling-based baselines (Wang et al., 29 Sep 2025).
Activation Steering: Mechanistic steering of transformer activations (e.g., at confidence-reporting tokens) can bring verbalized confidence into tight alignment with internal accuracy estimates, reducing calibration error by orders of magnitude without retraining (Miao et al., 26 Mar 2026, Zhao et al., 1 Apr 2026).

5. Empirical Insights Across Domains and Modalities

Verbalized confidence has been systematically evaluated in a diversity of domains:

Multi-turn and Agentic LLMs: Prompting web agents to express confidence at the end of long action sequences produces signals strongly correlated with answer accuracy. Test-time scaling policies leveraging this confidence can drastically reduce compute and token usage while maintaining accuracy (Ou et al., 27 Oct 2025).
Vision-LLMs (VLMs): Verbalized confidence in VLMs and LVLMs is generally more overconfident and less reliable than probabilistic or consistency-based confidence. Calibration improves with reasoning-centric architectures or explicit semantic perturbation during training (Zhao et al., 21 Apr 2025, Xuan et al., 26 May 2025, Ding et al., 26 Aug 2025).
Translation: Word- and token-level verbalized confidence performs comparably to internal probability-based metrics on fine-grained error detection and calibration, but there is little correlation between internal and verbalized signals—indicating that self-assessed correctness is partly independent from token-level competition (Marashian et al., 15 Jun 2026).
Self-Verification: Fine-tuning with scalar confidence supervision on reasoning tasks triggers emergent behavior where the model generates longer, more cautious chains of thought for low-confidence problems, linking self-reported uncertainty with internal checking dynamics (Jang et al., 4 Jun 2025).
Training Data Provenance: Content-groundness analyses show that larger models may derive confidence from generic, context-free affirmation data, rather than content-relevant reasoning, especially if pretraining includes many “confident” statements decoupled from underlying facts (Xia et al., 15 Jan 2026).

6. Limitations and Future Challenges

Persistent limitations include:

Overconfidence and Ceiling Compression: Across architectures and tasks, verbalized confidence is systematically overoptimistic, especially for complex or reasoning-focused workloads (Bhattacharyya et al., 8 May 2026, Seo et al., 13 Oct 2025, Ding et al., 26 Aug 2025).
Weak Correlation with Internal Signals: Empirical correlation between verbalized confidence and internal token probabilities is low (Spearman’s ρ≈0.1–0.2), with many “confidently wrong” outputs (Ni et al., 2024, Marashian et al., 15 Jun 2026, Zhang et al., 12 Dec 2025).
Anchoring and Discretization: Discrete confidence scales and anchor-token preference produce biased and low-sensitivity scores; irregular ranges or increased granularity only partially mitigate this effect (Dai, 10 Mar 2026).
Answer-Independence: Without explicit answer-conditioning objectives, confidence outputs may ignore the predicted answer, severely harming calibration (Seo et al., 13 Oct 2025).
Reliance on Natural Language Prompts: In settings with compositional or free-form prompting, prompt engineering effects can overtake intended calibration improvements; robustness to prompt choice remains underexplored (Wang et al., 18 Nov 2025, Zong et al., 28 Oct 2025).
Model/Task Dependency: Calibration method effectiveness varies with model architecture, training regime, and task type—order-aware or critique-based objectives offer the best robustness but still face edge cases (Zhang et al., 12 Dec 2025, Li et al., 12 May 2026).

Several open directions appear promising:

Hybrid, Model-Aware Calibration: Combining internally aligned and externally calibrated signals (e.g., DiNCo and activation steering) may yield optimal trade-offs.
Dynamic and Multidimensional Self-Assessment: Augmenting confidence with competence- or effort-oriented self-reports outperforms confidence alone, especially for reasoning-intensive tasks (Bhattacharyya et al., 8 May 2026).
Scale and Context Tuning: Treating confidence scale and bin design as tunable parameters in evaluation and deployment contexts is critical for valid uncertainty assessment (Dai, 10 Mar 2026).
Grounded Training: Incentivizing content-grounded confidence—perhaps through data attribution or representational objectives—may reduce superficial overconfidence (Xia et al., 15 Jan 2026, Seo et al., 13 Oct 2025).

7. Practical Recommendations and Deployment Implications

For practitioners implementing systems that rely on model self-assessment:

Use proper scoring rules (e.g., ConfTuner) and decoupled, order-aware objectives (e.g., ORCE) to maximize calibration without sacrificing answer accuracy (Li et al., 26 Aug 2025, Li et al., 12 May 2026).
Apply answer-dependent fine-tuning (e.g., ADVICE) to enforce groundedness of confidence estimates (Seo et al., 13 Oct 2025).
Combine verbalized with internal or sample-based uncertainty metrics (e.g., DiNCo, self-consistency) for safety-critical decisions (Wang et al., 29 Sep 2025, Ou et al., 27 Oct 2025).
If employing confidence as a compute-allocation policy, structured thresholds and test-time scaling yield substantial efficiency gains in multi-turn contexts (Ou et al., 27 Oct 2025, Jang et al., 4 Jun 2025).
For vision-language or translation models, supplement verbalized confidence with complementary signals (semantic perturbation for VLMs; error-aware calibration for MT) (Zhao et al., 21 Apr 2025, Marashian et al., 15 Jun 2026).
Prefer coarser, well-tuned scales ([0, 20]) over the standard [0, 100], and always visualize confidence histograms before deploying calibration-sensitive applications (Dai, 10 Mar 2026).
Consider multidimensional self-assessment (e.g., effort, ability, reflection) rather than relying on confidence alone, especially in heterogeneous workloads (Bhattacharyya et al., 8 May 2026).

Overall, while verbalized confidence offers a universally accessible surface-level uncertainty signal, reliable deployment hinges on rigorous calibration, answer grounding, and (where feasible) leveraging internal uncertainty representations in tandem with explicit self-reports.