On Verbalized Confidence Scores for LLMs

Published 19 Dec 2024 in cs.CL | (2412.14737v1)

Abstract: The rise of LLMs and their tight integration into our daily life make it essential to dedicate efforts towards their trustworthiness. Uncertainty quantification for LLMs can establish more human trust into their responses, but also allows LLM agents to make more informed decisions based on each other's uncertainty. To estimate the uncertainty in a response, internal token logits, task-specific proxy models, or sampling of multiple responses are commonly used. This work focuses on asking the LLM itself to verbalize its uncertainty with a confidence score as part of its output tokens, which is a promising way for prompt- and model-agnostic uncertainty quantification with low overhead. Using an extensive benchmark, we assess the reliability of verbalized confidence scores with respect to different datasets, models, and prompt methods. Our results reveal that the reliability of these scores strongly depends on how the model is asked, but also that it is possible to extract well-calibrated confidence scores with certain prompt methods. We argue that verbalized confidence scores can become a simple but effective and versatile uncertainty quantification method in the future. Our code is available at https://github.com/danielyxyang/LLM-verbalized-uq .

Abstract PDF HTML Upgrade to Chat

Summary

The paper investigates using large language models (LLMs) to verbalize confidence scores as a simple method for quantifying uncertainty, defining metrics like calibration, informativeness, and meaningfulness for evaluation.
Extensive experiments across 10 datasets and 11 diverse LLMs show that the reliability of verbalized confidence scores highly depends on model capacity and specific prompt design.
Simple prompts work best for smaller LLMs, while advanced techniques combining different strategies significantly improve calibration for larger models, achieving deviations as low as approximately 7-10%.

The paper investigates a low-overhead uncertainty quantification method for LLMs that directly leverages the model’s output by having it verbalize a confidence score as part of its generated tokens. The work is motivated by the need for trustworthiness in LLM-based systems deployed in consumer and agent applications, where traditional uncertainty quantification methods based on multiple sampling, internal logits, or external proxy models present limitations in terms of prompt- and model-agnosticism or computational cost. The authors rigorously evaluate verbalized confidence scores with respect to calibration, informativeness, and meaningfulness, introducing a flexible prompt interface that asks the LLM to output both an answer and a corresponding confidence score indicating its self-assessed correctness.

Methodological Contributions

Uncertainty Partitioning The paper proposes an intuitive partitioning of LLM uncertainty into three components: input uncertainty (variability due to prompt formulation), model uncertainty (intrinsic limitations related to LLM capacity and training), and output uncertainty (uncertainty in the answer correctness or adherence to task constraints). Although the study focuses solely on output uncertainty in terms of objective correctness, this partitioning underpins the evaluation design.
Formal Definition of Calibration Calibration is defined in terms of matching the estimated probability (confidence score) with the empirical correctness of the model’s response. Mathematically, if $C = \UQ(X,Y)$ is the confidence score for prompt $X$ and response $Y = \LLM(X)$, a perfectly calibrated uncertainty quantifier satisfies

$\Pr(\text{Y is correct} \mid C = c) = c,$

with the expected calibration error (ECE) computed over $M$ bins approximated by $\text{ECE} \approx \sum_{m=1}^{M} \frac{|B_m|}{n}\, |\text{acc}(B_m) - \text{conf}(B_m)|,$ where each bin $B_m$ groups responses with similar confidence scores. This formulation leverages metrics similar to those used in deep learning calibration studies.

Informativeness and Meaningfulness Metrics
- Informativeness: Evaluated through metrics such as the number of distinct confidence scores and the variance of the confidence distribution.
- Meaningfulness: Assessed via the Kullback-Leibler divergence between the confidence distribution for a specific dataset and that of a composite dataset spanning diverse task difficulties. This measures how well the verbalized scores reflect the differences across datasets.

Experimental Evaluation

Datasets and Models An extensive benchmarking study is conducted over 10 datasets designed with closed-ended, objective multiple-choice questions (and short answer formats) to ensure clearly defined correctness. The datasets span domains from science and commonsense reasoning to trivia, and tasks are characterized by factors like domain type (closed vs. open) and prompt context (closed-book vs. open-book). The evaluation employs 11 diverse LLMs, including open-source models from families such as Gemma1.1, Llama3, and Qwen1.5, as well as closed-source models from OpenAI’s GPT series, allowing investigation of model capacity effects on confidence calibration.
Prompt Engineering
- Score Range: Instructing the LLM to output confidence scores using different numeric or categorical scales (e.g., percentages, decimals between 0 and 1, letter grades, or verbal descriptions).
- Score Formulation: Variations in wording (e.g., “confidence score” vs. “probability that your answer is correct”) to assess sensitivity in elicited responses.
- Advanced Description: Incorporating elaborate instructions regarding how to factor in uncertainty related to prompt vagueness, task difficulty, and knowledge availability.
- Few-shot Prompting: The incorporation of one or several examples in the prompt to influence the calibration behavior.
- Other Aspects: Additional techniques such as asking for the “best guess” or ranking multiple top responses, even integrating chain-of-thought prompts in some configurations.
Findings Across Different LLM Capacities The reliability of verbalized confidence scores, as measured by calibration (with ECE values) as well as informativeness and meaningfulness, was shown to depend both on the model capacity and the prompt design:
- For Tiny LLMs (e.g., Gemma1.1-7B, Llama3-8B, Qwen1.5-7B):
Simpler prompt formulations—especially those employing the “probscore” formulation—result in better calibrated and more informative confidence responses. Complex methods, such as elaborate few-shot examples or ranking multiple responses, tend to degrade performance on these models. - For Large LLMs (e.g., models with 70B+ parameters such as Llama3-70B, Qwen1.5-32B/72B/110B, GPT-3.5/Turbo and GPT-4o variants):

A combination of advanced prompt techniques (advanced description, probscore formulation, and few-shot prompting) yields significant improvements in calibration. In fact, the “combo” method derived an average deviation of about 7% from the empirical accuracy, indicating that the verbalized confidence scores were closer to the true correctness probability.

Notable Quantitative Insights

The experimental results demonstrate that larger models achieve an ECE around 0.1 when using optimized prompting methods, meaning the verbalized confidence scores deviate by approximately 10% on average from the actual accuracy.
For large LLMs, combining multiple prompt strategies (i.e., the “combo” method) exhibits marked qualitative and quantitative improvements in calibration compared to baseline prompting methods, with clear shifts in the confidence score distribution and calibration curves.

Discussion and Limitations

The study underscores that verbalized confidence scores can serve as a simple, prompt- and model-agnostic approach for uncertainty quantification in LLMs, but their reliability is highly sensitive to the specific prompt design.
Limitations arise from the focus on objective correctness in closed-ended questions, leaving open how well these approaches generalize to open-ended or subjective tasks where “correctness” is less clearly defined.
The metrics for informativeness and meaningfulness proposed by the authors are novel, and while they provide insights into the diversity and dataset sensitivity of the scores, further validation of these metrics in broader contexts may be necessary.

Overall, the paper presents a comprehensive study that balances rigorous metric definitions, extensive experimental evaluations, and practical prompt engineering insights, making it a valuable reference for improving uncertainty quantification in LLMs through verbalized confidence scores.

Markdown