The paper investigates a low-overhead uncertainty quantification method for LLMs that directly leverages the model’s output by having it verbalize a confidence score as part of its generated tokens. The work is motivated by the need for trustworthiness in LLM-based systems deployed in consumer and agent applications, where traditional uncertainty quantification methods based on multiple sampling, internal logits, or external proxy models present limitations in terms of prompt- and model-agnosticism or computational cost. The authors rigorously evaluate verbalized confidence scores with respect to calibration, informativeness, and meaningfulness, introducing a flexible prompt interface that asks the LLM to output both an answer and a corresponding confidence score indicating its self-assessed correctness.
Methodological Contributions
- Uncertainty Partitioning The paper proposes an intuitive partitioning of LLM uncertainty into three components: input uncertainty (variability due to prompt formulation), model uncertainty (intrinsic limitations related to LLM capacity and training), and output uncertainty (uncertainty in the answer correctness or adherence to task constraints). Although the paper focuses solely on output uncertainty in terms of objective correctness, this partitioning underpins the evaluation design.
- Formal Definition of Calibration Calibration is defined in terms of matching the estimated probability (confidence score) with the empirical correctness of the model’s response. Mathematically, if $C = \UQ(X,Y)$ is the confidence score for prompt and response $Y = \LLM(X)$, a perfectly calibrated uncertainty quantifier satisfies
with the expected calibration error (ECE) computed over bins approximated by where each bin groups responses with similar confidence scores. This formulation leverages metrics similar to those used in deep learning calibration studies.
- Informativeness and Meaningfulness Metrics
- Informativeness: Evaluated through metrics such as the number of distinct confidence scores and the variance of the confidence distribution.
- Meaningfulness: Assessed via the Kullback-Leibler divergence between the confidence distribution for a specific dataset and that of a composite dataset spanning diverse task difficulties. This measures how well the verbalized scores reflect the differences across datasets.
Experimental Evaluation
- Datasets and Models An extensive benchmarking paper is conducted over 10 datasets designed with closed-ended, objective multiple-choice questions (and short answer formats) to ensure clearly defined correctness. The datasets span domains from science and commonsense reasoning to trivia, and tasks are characterized by factors like domain type (closed vs. open) and prompt context (closed-book vs. open-book). The evaluation employs 11 diverse LLMs, including open-source models from families such as Gemma1.1, Llama3, and Qwen1.5, as well as closed-source models from OpenAI’s GPT series, allowing investigation of model capacity effects on confidence calibration.
- Prompt Engineering
- Score Range: Instructing the LLM to output confidence scores using different numeric or categorical scales (e.g., percentages, decimals between 0 and 1, letter grades, or verbal descriptions).
- Score Formulation: Variations in wording (e.g., “confidence score” vs. “probability that your answer is correct”) to assess sensitivity in elicited responses.
- Advanced Description: Incorporating elaborate instructions regarding how to factor in uncertainty related to prompt vagueness, task difficulty, and knowledge availability.
- Few-shot Prompting: The incorporation of one or several examples in the prompt to influence the calibration behavior.
- Other Aspects: Additional techniques such as asking for the “best guess” or ranking multiple top responses, even integrating chain-of-thought prompts in some configurations.
- Findings Across Different LLM Capacities
The reliability of verbalized confidence scores, as measured by calibration (with ECE values) as well as informativeness and meaningfulness, was shown to depend both on the model capacity and the prompt design:
- For Tiny LLMs (e.g., Gemma1.1-7B, Llama3-8B, Qwen1.5-7B):
Simpler prompt formulations—especially those employing the “probscore” formulation—result in better calibrated and more informative confidence responses. Complex methods, such as elaborate few-shot examples or ranking multiple responses, tend to degrade performance on these models. - For Large LLMs (e.g., models with 70B+ parameters such as Llama3-70B, Qwen1.5-32B/72B/110B, GPT-3.5/Turbo and GPT-4o variants):
A combination of advanced prompt techniques (advanced description, probscore formulation, and few-shot prompting) yields significant improvements in calibration. In fact, the “combo” method derived an average deviation of about 7% from the empirical accuracy, indicating that the verbalized confidence scores were closer to the true correctness probability.
Notable Quantitative Insights
- The experimental results demonstrate that larger models achieve an ECE around 0.1 when using optimized prompting methods, meaning the verbalized confidence scores deviate by approximately 10% on average from the actual accuracy.
- For large LLMs, combining multiple prompt strategies (i.e., the “combo” method) exhibits marked qualitative and quantitative improvements in calibration compared to baseline prompting methods, with clear shifts in the confidence score distribution and calibration curves.
Discussion and Limitations
- The paper underscores that verbalized confidence scores can serve as a simple, prompt- and model-agnostic approach for uncertainty quantification in LLMs, but their reliability is highly sensitive to the specific prompt design.
- Limitations arise from the focus on objective correctness in closed-ended questions, leaving open how well these approaches generalize to open-ended or subjective tasks where “correctness” is less clearly defined.
- The metrics for informativeness and meaningfulness proposed by the authors are novel, and while they provide insights into the diversity and dataset sensitivity of the scores, further validation of these metrics in broader contexts may be necessary.
Overall, the paper presents a comprehensive paper that balances rigorous metric definitions, extensive experimental evaluations, and practical prompt engineering insights, making it a valuable reference for improving uncertainty quantification in LLMs through verbalized confidence scores.