Verbal Confidence in AI Systems

Updated 4 July 2026

Verbal confidence is the external expression of an AI model’s reliability, providing numeric, percentage, or categorical estimates of output trustworthiness.
It supports mechanisms such as calibration, abstention, and self-correction, yet shows dissociations in metacognitive sensitivity and decision-making alignment.
Multiple elicitation and evaluation methods, including Likert scales and decision-theoretic measures, help improve reliability, calibration, and behavioral utility.

Searching arXiv for papers on verbal confidence in LLMs and related systems. Verbal confidence is the explicit expression of an AI system’s confidence in its own output, typically as a number, percentage, category, or natural-language statement attached to an answer. In the recent literature, it is treated as a black-box-accessible uncertainty signal that can support calibration, abstention, self-correction, model cascade, and trust calibration in settings where token logits or hidden states are unavailable. At the same time, work across LLMs, VLMs, RAG systems, machine translation, and long-form generation shows that verbal confidence is not a unitary property: calibration, answer-grounding, metacognitive sensitivity, robustness, and behavioral faithfulness can come apart sharply, so a model may verbalize uncertainty in a statistically useful way while still failing to use that uncertainty appropriately in decision making (Jang et al., 4 Jun 2025, Dai, 10 Mar 2026, Wang et al., 12 Jan 2026).

1. Definition and conceptual scope

In this literature, verbal confidence is not merely an internal probability proxy but an externally reported estimate of reliability. One influential distinction states that uncertainty concerns the input alone, $p(\cdot \mid q)$ , whereas confidence concerns both the input and the generated answer, $p(\cdot \mid q,a)$ ; under this view, true confidence should be answer-conditioned rather than answer-independent (Seo et al., 13 Oct 2025). This distinction matters because many observed failure modes arise precisely when the model’s verbal confidence does not condition strongly on its own answer.

Several elicitation formats have been studied. In chain-of-thought settings, a model may produce a structured triple

$\langle \text{think} \rangle r \langle / \text{think} \rangle,\quad \langle \text{answer} \rangle a \langle / \text{answer} \rangle,\quad \langle \text{confidence} \rangle c \langle / \text{confidence} \rangle,$

where only the confidence span is supervised (Jang et al., 4 Jun 2025). Other work uses categorical labels, Likert scales, or direct numeric outputs such as integers in $[0,100]$ , decimals in $[0,1]$ , or six-level qualitative labels mapped to $\{0,0.2,0.4,0.6,0.8,1.0\}$ (Marashian et al., 15 Jun 2026). In machine translation, verbal confidence has even been elicited at word or token granularity rather than only at the whole-answer level, reflecting the fact that confidence-relevant errors can occur at different spans (Marashian et al., 15 Jun 2026).

A recurring theme is that verbal confidence is intended to report correctness likelihood rather than token-sequence likelihood. This distinction is explicit in work contrasting verbal confidence with internal certainty signals such as token probability or entropy: token probabilities can reflect competition among alternative surface forms, while verbal confidence is meant to self-report correctness of the produced answer or translation (Marashian et al., 15 Jun 2026). A plausible implication is that verbal confidence belongs conceptually to metacognitive readout rather than ordinary next-token prediction.

2. Evaluation criteria and psychometric measurement

The dominant evaluation framework has been calibration: whether stated confidence matches empirical correctness. The standard metric is Expected Calibration Error,

$\mathrm{ECE} = \sum_{m=1}^{M} \frac{|B_m|}{N}\left|\mathrm{acc}(B_m)-\mathrm{conf}(B_m)\right|,$

usually with $M=10$ bins, often alongside Brier score and AUROC (Liu et al., 16 Jan 2026). In RiskEval, these are supplemented by abstention rate, answered accuracy, average utility, normalized utility, AUARC, and regret, reflecting a shift from purely descriptive calibration to decision-theoretic evaluation (Wang et al., 12 Jan 2026).

A second family of measures evaluates metacognitive sensitivity rather than bin-wise calibration. “Rescaling Confidence” uses meta- $d'$ and metacognitive efficiency,

$M_{ratio}=\frac{meta\text{-}d'}{d'},$

to ask how well confidence separates correct from incorrect answers independently of raw task difficulty or response bias (Dai, 10 Mar 2026). This is motivated in part by the observation that verbal confidence often becomes heavily discretized, making ECE unstable under concentrated histograms.

A third line of work treats verbal confidence as a psychometric signal that should pass minimal validity screening before any downstream use. In a pre-registered study on 3–9B open-weight instruction-tuned models, a model-condition cell was immediately classified Invalid if the confidence signal had fewer than 3 distinct values or if more than 95% of binarized responses fell in a single category; the same study defined $p(\cdot \mid q,a)$ 0, $p(\cdot \mid q,a)$ 1, and $p(\cdot \mid q,a)$ 2 as diagnostic indices (Cacioli, 24 Apr 2026). This framework treats saturation as a validity failure rather than merely a calibration defect.

Taken together, these measurement schemes imply that verbal confidence can be assessed at several non-equivalent levels: average calibration, item-level discrimination, ranking quality, abstention utility, and psychometric validity. A system may perform acceptably on one level while failing on another.

3. Faithfulness to correctness, abstention, and action

The most explicit formalization of the confidence-faithfulness problem is RiskEval, which turns confidence into a decision-theoretic abstention test. For an input $p(\cdot \mid q,a)$ 3, a model chooses $p(\cdot \mid q,a)$ 4, reports confidence $p(\cdot \mid q,a)$ 5, and receives utility $p(\cdot \mid q,a)$ 6 for a correct answer, $p(\cdot \mid q,a)$ 7 for an incorrect answer, and $p(\cdot \mid q,a)$ 8 for abstention. Under the model’s own belief $p(\cdot \mid q,a)$ 9, the Bayes-optimal answer threshold is

If verbal confidence is behaviorally meaningful, increasing $\langle \text{think} \rangle r \langle / \text{think} \rangle,\quad \langle \text{answer} \rangle a \langle / \text{answer} \rangle,\quad \langle \text{confidence} \rangle c \langle / \text{confidence} \rangle,$ 1 should increase abstention. Instead, frontier models evaluated on HLE, GPQA Diamond, and GSM8K showed a strong dissociation: reported confidence stayed almost flat as $\langle \text{think} \rangle r \langle / \text{think} \rangle,\quad \langle \text{answer} \rangle a \langle / \text{answer} \rangle,\quad \langle \text{confidence} \rangle c \langle / \text{confidence} \rangle,$ 2 increased, abstention rates were largely insensitive to penalty, and average utility often became strongly negative under high penalties, a pattern described as utility collapse (Wang et al., 12 Jan 2026).

RiskEval therefore argues that calibration alone is insufficient. Its policy-consistency metric,

measures whether the model’s actual answer/abstain choice matches the Bayes-optimal action implied by its own confidence, while normalized regret

measures the decision error in probability space (Wang et al., 12 Jan 2026). Post-hoc application of $\langle \text{think} \rangle r \langle / \text{think} \rangle,\quad \langle \text{answer} \rangle a \langle / \text{answer} \rangle,\quad \langle \text{confidence} \rangle c \langle / \text{confidence} \rangle,$ 5 consistently improved utility, indicating that the confidence signal was often informative but not behaviorally used by the model.

A related but distinct diagnosis is answer-independence. ADVICE tests whether $\langle \text{think} \rangle r \langle / \text{think} \rangle,\quad \langle \text{answer} \rangle a \langle / \text{answer} \rangle,\quad \langle \text{confidence} \rangle c \langle / \text{confidence} \rangle,$ 6 for different answers $\langle \text{think} \rangle r \langle / \text{think} \rangle,\quad \langle \text{answer} \rangle a \langle / \text{answer} \rangle,\quad \langle \text{confidence} \rangle c \langle / \text{confidence} \rangle,$ 7, using Jensen–Shannon Divergence between answer-conditioned and answer-marginalized confidence distributions. The reported JSD values were mostly near zero, which the paper interprets as evidence that current verbal confidence is often nearly answer-independent; this answer-independence is presented as a key factor behind overconfidence (Seo et al., 13 Oct 2025).

Recent work also challenges the assumption that verbal confidence primarily tracks correctness at all. In a two-stage abstention paradigm, verbal confidence predicted the later commit/abstain decision substantially better than whether the answer was correct. Across all 46 verbal-confidence cells, the average decision-truth gap was $\langle \text{think} \rangle r \langle / \text{think} \rangle,\quad \langle \text{answer} \rangle a \langle / \text{answer} \rangle,\quad \langle \text{confidence} \rangle c \langle / \text{confidence} \rangle,$ 8, while calibrated token log-probabilities showed a near-zero decision-truth gap and stronger correctness discrimination. After residualizing verbal confidence on calibrated log-probabilities, the residual still predicted abstention strongly but its link to correctness fell to near chance, leading to the interpretation that verbal confidence is better understood as a behavior-facing readout of commit-readiness than as a direct correctness estimate (Kumaran, 28 Jun 2026).

These results collectively undercut a common misconception: a model that can produce a numerically calibrated confidence score is not thereby guaranteed to possess confidence-faithful decision policies, answer-grounded confidence, or correctness-tracking self-reports.

4. Internal computation and mechanistic interpretations

Mechanistic work has shifted the discussion from surface elicitation to internal computation. “How do LLMs Compute Verbal Confidence” argues that verbal confidence is not computed just-in-time when the model is asked to rate itself. Instead, answer tokens are read to form a confidence signal, that signal is cached at the post-answer-newline token (PANL), and the confidence-colon token (CC) later retrieves it for verbalization. Attention blocking, activation steering, patching, noising, and activation swap converge on the same pathway: confidence representations emerge at answer-adjacent positions before appearing at the verbalization site, and the resulting signal is not reducible to token log-probabilities (Kumaran et al., 18 Mar 2026).

A second mechanistic result concerns geometry. “Closing the Confidence-Faithfulness Gap” reports that both internal calibration signals and verbalized confidence are linearly encoded, but their probe directions are nearly orthogonal, with cosine similarity $\langle \text{think} \rangle r \langle / \text{think} \rangle,\quad \langle \text{answer} \rangle a \langle / \text{answer} \rangle,\quad \langle \text{confidence} \rangle c \langle / \text{confidence} \rangle,$ 9 across layers and models. Under pure confidence prompting, these directions show weak positive alignment; under joint solve-and-rate prompting, the relationship can flip strongly negative, a phenomenon named the Reasoning Contamination Effect. The same work introduces a two-stage adaptive steering pipeline that first reads the model’s internal accuracy estimate and then steers a separate confidence-only pass so that verbalized confidence matches that estimate, substantially improving calibration alignment (Miao et al., 26 Mar 2026).

Circuit-level analysis further localizes inflated confidence. “Wired for Overconfidence” introduces Target-Set Logit Difference (TSLD) as a differentiable proxy for preference for high-confidence versus low-confidence outputs and identifies a compact Confidence Mover Circuit concentrated in middle-to-late-layer MLP blocks and attention heads. Across Qwen2.5-3B-Instruct and Llama-3.2-3B-Instruct on PopQA, MMLU, and NQOpen, the top 3,000 edges—less than $[0,100]$ 0 of all edges for Qwen and less than $[0,100]$ 1 for Llama—captured most of the overconfidence effect, and targeted activation steering or mean ablation on the top-10 components substantially improved ECE and Brier score (Zhao et al., 1 Apr 2026).

At the training-data level, TracVC asks where verbalized confidence comes from. It retrieves content-related examples $[0,100]$ 2 and confidence-related examples $[0,100]$ 3, estimates their influence using cosine similarity between gradients, and summarizes relative grounding with the content-over-confidence ratio

$[0,100]$ 4

For OLMo2-13B, $[0,100]$ 5 was consistently below 1, around $[0,100]$ 6– $[0,100]$ 7, which the paper interprets as evidence that verbalized confidence is frequently influenced by confidence-related data that is lexically unrelated to the query; by contrast, OLMo2-7B and OLMo-7B were mostly above 1 (Xia et al., 15 Jan 2026). This suggests that some models may learn the rhetoric of certainty more readily than content-grounded conditions for justified certainty.

5. Alignment and training methods

A substantial line of work attempts to improve verbal confidence through fine-tuning, reinforcement learning, or prompt-only scaffolds. Confidence-Supervised Fine-Tuning (CSFT) uses self-consistency-derived labels on GSM8K, supervising only the confidence span while leaving reasoning unlabelled. The target is constructed from $[0,100]$ 8 sampled generations via empirical accuracy $[0,100]$ 9, discretized to $[0,1]$ 0. On GSM8K, LLaMA3.2-3B-Instruct improved from ACC $[0,1]$ 1, AUROC $[0,1]$ 2, ECE $[0,1]$ 3, and Brier Score $[0,1]$ 4; low-confidence queries also triggered longer, more self-checking chain-of-thought traces, with about $[0,1]$ 5 of CSFT generations showing self-verification versus under $[0,1]$ 6 in zero-shot (Jang et al., 4 Jun 2025).

ConfTuner replaces proxy confidence labels with a tokenized Brier score,

$[0,1]$ 7

which it proves is a proper scoring rule for verbalized confidence. Trained on HotpotQA and evaluated on HotpotQA, TriviaQA, StrategyQA, GSM8K, and TruthfulQA, it reports improvements of up to $[0,1]$ 8 in ECE and $[0,1]$ 9 in AUROC, while also improving downstream self-correction and model-cascade behavior (Li et al., 26 Aug 2025).

ORCE addresses a different failure mode: joint optimization of answer and confidence can let calibration objectives interfere with answer accuracy. It therefore decouples answer generation from confidence estimation, conditions confidence on a fixed question-answer pair, estimates prompt-level reliability with a Monte Carlo surrogate $\{0,0.2,0.4,0.6,0.8,1.0\}$ 0, and optimizes rank-based objectives via DPO. On MMLU, for example, ECE for Llama-3 8B drops from $\{0,0.2,0.4,0.6,0.8,1.0\}$ 1 to $\{0,0.2,0.4,0.6,0.8,1.0\}$ 2, and accuracy is largely preserved by construction because the answer model is not altered by confidence optimization (Li et al., 12 May 2026).

ADVICE explicitly targets answer-groundedness. It constructs training triplets $\{0,0.2,0.4,0.6,0.8,1.0\}$ 3 from TriviaQA and optimizes a combined language-modeling, JSD, and margin loss to ensure that the confidence distributions for correct and wrong answers differ and that correct answers receive higher expected confidence. It reports improved calibration on TriviaQA, MMLU, LogiQA, and often SciQ while preserving task performance, and answer masking shows that ADVICE drops confidence appropriately when answer tokens are removed whereas the default model remains overconfident (Seo et al., 13 Oct 2025).

Prompt-only methods remain important where model weights cannot be changed. I-CALM combines explicit confidence elicitation, reward framing for answer-versus-abstain decisions, and lightweight norms emphasizing truthfulness, humility, and responsibility. On GPT-5 mini with PopQA, Pure Eval yields $\{0,0.2,0.4,0.6,0.8,1.0\}$ 4 at coverage $\{0,0.2,0.4,0.6,0.8,1.0\}$ 5, Scheme B $\{0,0.2,0.4,0.6,0.8,1.0\}$ 6 yields $\{0,0.2,0.4,0.6,0.8,1.0\}$ 7 at coverage $\{0,0.2,0.4,0.6,0.8,1.0\}$ 8, and Scheme B plus norms yields $\{0,0.2,0.4,0.6,0.8,1.0\}$ 9 at coverage $\mathrm{ECE} = \sum_{m=1}^{M} \frac{|B_m|}{N}\left|\mathrm{acc}(B_m)-\mathrm{conf}(B_m)\right|,$ 0, with $\mathrm{ECE} = \sum_{m=1}^{M} \frac{|B_m|}{N}\left|\mathrm{acc}(B_m)-\mathrm{conf}(B_m)\right|,$ 1 staying similar across schemes. The effect is therefore selective answering rather than a change in underlying forced-answer competence (Zong et al., 5 Apr 2026).

Not all alignment attempts succeed. A pre-registered attempt to distill 10-sample self-consistency into a single-pass verbal readout on Gemma 3 4B-it failed when a modal filter restricted training to items with correct modal answers: AUROC $\mathrm{ECE} = \sum_{m=1}^{M} \frac{|B_m|}{N}\left|\mathrm{acc}(B_m)-\mathrm{conf}(B_m)\right|,$ 2 dropped from $\mathrm{ECE} = \sum_{m=1}^{M} \frac{|B_m|}{N}\left|\mathrm{acc}(B_m)-\mathrm{conf}(B_m)\right|,$ 3 to $\mathrm{ECE} = \sum_{m=1}^{M} \frac{|B_m|}{N}\left|\mathrm{acc}(B_m)-\mathrm{conf}(B_m)\right|,$ 4 because target entropy collapsed. An exploratory rescue that removed the filter and trained on all 2,000 calibration items produced AUROC $\mathrm{ECE} = \sum_{m=1}^{M} \frac{|B_m|}{N}\left|\mathrm{acc}(B_m)-\mathrm{conf}(B_m)\right|,$ 5 on held-out TriviaQA, with ceiling rate reduced to $\mathrm{ECE} = \sum_{m=1}^{M} \frac{|B_m|}{N}\left|\mathrm{acc}(B_m)-\mathrm{conf}(B_m)\right|,$ 6, but the learned confidence was effectively binary—494 items at $\mathrm{ECE} = \sum_{m=1}^{M} \frac{|B_m|}{N}\left|\mathrm{acc}(B_m)-\mathrm{conf}(B_m)\right|,$ 7 and 498 at $\mathrm{ECE} = \sum_{m=1}^{M} \frac{|B_m|}{N}\left|\mathrm{acc}(B_m)-\mathrm{conf}(B_m)\right|,$ 8—so the result was explicitly described as discriminatory rather than continuously calibrated (Cacioli, 27 Apr 2026).

6. Domain-specific variants and multimodal extensions

The verbal-confidence literature is no longer confined to short-form text QA. In retrieval-augmented generation, NAACL studies noise-aware verbal confidence calibration under contradictory and irrelevant retrieval. It formalizes three rules—Conflict Independence, Noise Invariance, and Parametric Fallback—and trains on roughly 2,000 HotpotQA-derived trajectories filtered for passage judgment accuracy, rule adherence, Brier-score alignment, and class balance. Across StrategyQA, HotpotQA, Natural Questions, and Bamboogle, NAACL improves ECE by $\mathrm{ECE} = \sum_{m=1}^{M} \frac{|B_m|}{N}\left|\mathrm{acc}(B_m)-\mathrm{conf}(B_m)\right|,$ 9 in-domain and $M=10$ 0 out-of-domain, while baseline RAG calibration often had average ECE exceeding $M=10$ 1 (Liu et al., 16 Jan 2026).

In machine translation, verbal confidence is elicited per token or per word through five methods: List, Word_Numeric, Word_Likert, Token_Numeric, and Token_Likert. Reliability is evaluated by fine-grained error detection and calibration against human Error Span Annotations from the WMT2024 General Machine Translation shared task. Verbalized and internal methods perform similarly overall, though entropy is consistently the best calibration method, Likert methods usually calibrate better than Numeric methods, and there is little to no correlation between internal and verbalized confidence. The paper therefore concludes that internal certainty and verbalized confidence are largely distinct signals in MT (Marashian et al., 15 Jun 2026).

Vision-LLMs introduce modality-specific complications. A broad evaluation of verbalized calibration across MMMU-Pro, VideoMMMU, Visual SimpleQA, MathVista, MathVision, and IsoBench finds that most VLMs are overconfident, with instruction-tuned and text-reasoning models often showing ECE above $M=10$ 2. Vision-centric reasoning models calibrate better: in the general setting, o3 achieves ECE $M=10$ 3 on MMMU-Pro and $M=10$ 4 on MathVision, compared with o1 at $M=10$ 5 and $M=10$ 6, GPT-4.1 at $M=10$ 7 and $M=10$ 8, and Qwen2.5-VL 7B at $M=10$ 9 and $d'$ 0. The same work proposes Visual Confidence-Aware Prompting, under which Qwen2.5-VL 7B improves from ECE $d'$ 1 to $d'$ 2 and accuracy $d'$ 3 to $d'$ 4 on IsoBench (Xuan et al., 26 May 2025).

A complementary VLM line focuses on object-level confidence. CSP constructs semantically perturbed training data by using GroundingDINO and SAM to localize key object regions, applies Gaussian noise only to the masked region, maps perturbation severity to target confidence, and trains with supervised fine-tuning followed by SimPO. On POPE and AMBER, it reports marked improvements, such as Qwen2-VL on POPE adversarial improving from Acc $d'$ 5, F1 $d'$ 6, and ECE $d'$ 7 (Zhao et al., 21 Apr 2025).

Long-form generation introduces sentence-level rather than answer-level confidence. LoVeC trains models to append a numerical confidence score to each generated sentence and evaluates free-form tagging and iterative tagging. Using DPO, ORPO, and GRPO, it reports that LoVeC-DPO and LoVeC-GRPO outperform LUQ and other baselines across WildHallu, Bios, and PopQA. For Llama-3-8B-Instruct on WildHallu in free-form tagging, LUQ has BS $d'$ 8, ECE-M $d'$ 9, SC $M_{ratio}=\frac{meta\text{-}d'}{d'},$ 0, while LoVeC-DPO achieves BS $M_{ratio}=\frac{meta\text{-}d'}{d'},$ 1, ECE-M $M_{ratio}=\frac{meta\text{-}d'}{d'},$ 2, SC $M_{ratio}=\frac{meta\text{-}d'}{d'},$ 3, and LoVeC-GRPO achieves BS $M_{ratio}=\frac{meta\text{-}d'}{d'},$ 4, ECE-M $M_{ratio}=\frac{meta\text{-}d'}{d'},$ 5, SC $M_{ratio}=\frac{meta\text{-}d'}{d'},$ 6 (Zhang et al., 29 May 2025).

7. Limitations, robustness, and open problems

One of the clearest methodological lessons is that scale design is not neutral. Under the standard $M_{ratio}=\frac{meta\text{-}d'}{d'},$ 7– $M_{ratio}=\frac{meta\text{-}d'}{d'},$ 8 prompt, more than $M_{ratio}=\frac{meta\text{-}d'}{d'},$ 9 of responses concentrate on just three round-number values, with the top three values covering $p(\cdot \mid q,a)$ 00 to $p(\cdot \mid q,a)$ 01 of outputs and entropy ranging from $p(\cdot \mid q,a)$ 02 bits to $p(\cdot \mid q,a)$ 03 bits, far below the $p(\cdot \mid q,a)$ 04 bits of a uniform 101-point distribution. Manipulating granularity, boundary placement, and range regularity shows that $p(\cdot \mid q,a)$ 05 consistently improves metacognitive efficiency over $p(\cdot \mid q,a)$ 06, whereas aggressive boundary compression to $p(\cdot \mid q,a)$ 07 degrades performance because models fail to semantically redistribute confidence across the compressed range (Dai, 10 Mar 2026).

Small open-weight instruction-tuned models may fail even more fundamentally under minimal elicitation. In a pre-registered psychometric screen on 524 TriviaQA items under numeric and categorical elicitation, all seven confirmatory instruct models were classified Invalid on numeric confidence, with mean ceiling rate $p(\cdot \mid q,a)$ 08; categorical elicitation did not rescue validity and instead disrupted task performance in six of seven models, producing accuracy below $p(\cdot \mid q,a)$ 09 (Cacioli, 24 Apr 2026). This result does not imply that internal uncertainty representations are absent, but it does imply that minimal verbal elicitation can fail to preserve them at the output interface.

Adversarial robustness is another open problem. A comprehensive attack study introduces perturbation-based confidence attacks and universal jailbreak-style triggers, showing that verbal confidence can be reduced by semantic-preserving modifications and that answer changes often accompany the confidence drop. ConfidenceTriggers can cause up to $p(\cdot \mid q,a)$ 10 average confidence reduction in some cases, ConfidenceTriggers-AutoDAN up to $p(\cdot \mid q,a)$ 11, and common defenses such as perplexity filters, LLM-Guard, paraphrase defense, and SmoothLLM are described as largely ineffective or even counterproductive (Obadinma et al., 9 Jul 2025).

Across these debates, several misconceptions are now difficult to sustain. Verbal confidence is not interchangeable with token log-probability confidence; it is not guaranteed to be answer-grounded; it is not automatically faithful to abstention or commit decisions; and its informativeness depends on elicitation format, scale design, task structure, and sometimes identifiable internal circuits. The literature therefore increasingly treats verbal confidence not as a single scalar property of a model, but as an interface phenomenon whose reliability depends on the interaction among model, prompt, task, and evaluation regime (Kumaran, 28 Jun 2026, Wang et al., 12 Jan 2026).