- The paper investigates using large language models (LLMs) to verbalize confidence scores as a simple method for quantifying uncertainty, defining metrics like calibration, informativeness, and meaningfulness for evaluation.
- Extensive experiments across 10 datasets and 11 diverse LLMs show that the reliability of verbalized confidence scores highly depends on model capacity and specific prompt design.
- Simple prompts work best for smaller LLMs, while advanced techniques combining different strategies significantly improve calibration for larger models, achieving deviations as low as approximately 7-10%.
The paper investigates a low-overhead uncertainty quantification method for LLMs that directly leverages the modelโs output by having it verbalize a confidence score as part of its generated tokens. The work is motivated by the need for trustworthiness in LLM-based systems deployed in consumer and agent applications, where traditional uncertainty quantification methods based on multiple sampling, internal logits, or external proxy models present limitations in terms of prompt- and model-agnosticism or computational cost. The authors rigorously evaluate verbalized confidence scores with respect to calibration, informativeness, and meaningfulness, introducing a flexible prompt interface that asks the LLM to output both an answer and a corresponding confidence score indicating its self-assessed correctness.
Methodological Contributions
- Uncertainty Partitioning
The paper proposes an intuitive partitioning of LLM uncertainty into three components: input uncertainty (variability due to prompt formulation), model uncertainty (intrinsic limitations related to LLM capacity and training), and output uncertainty (uncertainty in the answer correctness or adherence to task constraints). Although the study focuses solely on output uncertainty in terms of objective correctness, this partitioning underpins the evaluation design.
- Formal Definition of Calibration
Calibration is defined in terms of matching the estimated probability (confidence score) with the empirical correctness of the modelโs response. Mathematically, if $C = \UQ(X,Y)$ is the confidence score for prompt X and response $Y = \LLM(X)$, a perfectly calibrated uncertainty quantifier satisfies
Pr(Yย isย correctโฃC=c)=c,
with the expected calibration error (ECE) computed over M bins approximated by
ECEโm=1โMโnโฃBmโโฃโโฃacc(Bmโ)โconf(Bmโ)โฃ,
where each bin Bmโ groups responses with similar confidence scores. This formulation leverages metrics similar to those used in deep learning calibration studies.
- Informativeness and Meaningfulness Metrics
- Informativeness: Evaluated through metrics such as the number of distinct confidence scores and the variance of the confidence distribution.
- Meaningfulness: Assessed via the Kullback-Leibler divergence between the confidence distribution for a specific dataset and that of a composite dataset spanning diverse task difficulties. This measures how well the verbalized scores reflect the differences across datasets.
Experimental Evaluation
- Datasets and Models
An extensive benchmarking study is conducted over 10 datasets designed with closed-ended, objective multiple-choice questions (and short answer formats) to ensure clearly defined correctness. The datasets span domains from science and commonsense reasoning to trivia, and tasks are characterized by factors like domain type (closed vs. open) and prompt context (closed-book vs. open-book). The evaluation employs 11 diverse LLMs, including open-source models from families such as Gemma1.1, Llama3, and Qwen1.5, as well as closed-source models from OpenAIโs GPT series, allowing investigation of model capacity effects on confidence calibration.
- Prompt Engineering
- Score Range: Instructing the LLM to output confidence scores using different numeric or categorical scales (e.g., percentages, decimals between 0 and 1, letter grades, or verbal descriptions).
- Score Formulation: Variations in wording (e.g., โconfidence scoreโ vs. โprobability that your answer is correctโ) to assess sensitivity in elicited responses.
- Advanced Description: Incorporating elaborate instructions regarding how to factor in uncertainty related to prompt vagueness, task difficulty, and knowledge availability.
- Few-shot Prompting: The incorporation of one or several examples in the prompt to influence the calibration behavior.
- Other Aspects: Additional techniques such as asking for the โbest guessโ or ranking multiple top responses, even integrating chain-of-thought prompts in some configurations.
- Findings Across Different LLM Capacities
The reliability of verbalized confidence scores, as measured by calibration (with ECE values) as well as informativeness and meaningfulness, was shown to depend both on the model capacity and the prompt design:
- For Tiny LLMs (e.g., Gemma1.1-7B, Llama3-8B, Qwen1.5-7B):
Simpler prompt formulationsโespecially those employing the โprobscoreโ formulationโresult in better calibrated and more informative confidence responses. Complex methods, such as elaborate few-shot examples or ranking multiple responses, tend to degrade performance on these models.
- For Large LLMs (e.g., models with 70B+ parameters such as Llama3-70B, Qwen1.5-32B/72B/110B, GPT-3.5/Turbo and GPT-4o variants):
A combination of advanced prompt techniques (advanced description, probscore formulation, and few-shot prompting) yields significant improvements in calibration. In fact, the โcomboโ method derived an average deviation of about 7% from the empirical accuracy, indicating that the verbalized confidence scores were closer to the true correctness probability.
Notable Quantitative Insights
- The experimental results demonstrate that larger models achieve an ECE around 0.1 when using optimized prompting methods, meaning the verbalized confidence scores deviate by approximately 10% on average from the actual accuracy.
- For large LLMs, combining multiple prompt strategies (i.e., the โcomboโ method) exhibits marked qualitative and quantitative improvements in calibration compared to baseline prompting methods, with clear shifts in the confidence score distribution and calibration curves.
Discussion and Limitations
- The study underscores that verbalized confidence scores can serve as a simple, prompt- and model-agnostic approach for uncertainty quantification in LLMs, but their reliability is highly sensitive to the specific prompt design.
- Limitations arise from the focus on objective correctness in closed-ended questions, leaving open how well these approaches generalize to open-ended or subjective tasks where โcorrectnessโ is less clearly defined.
- The metrics for informativeness and meaningfulness proposed by the authors are novel, and while they provide insights into the diversity and dataset sensitivity of the scores, further validation of these metrics in broader contexts may be necessary.
Overall, the paper presents a comprehensive study that balances rigorous metric definitions, extensive experimental evaluations, and practical prompt engineering insights, making it a valuable reference for improving uncertainty quantification in LLMs through verbalized confidence scores.