Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback (2305.14975v2)

Published 24 May 2023 in cs.CL

Abstract: A trustworthy real-world prediction system should produce well-calibrated confidence scores; that is, its confidence in an answer should be indicative of the likelihood that the answer is correct, enabling deferral to an expert in cases of low-confidence predictions. Recent studies have shown that unsupervised pre-training produces LLMs (LMs) whose conditional probabilities are remarkably well-calibrated. However, the most widely-used LMs are fine-tuned with reinforcement learning from human feedback (RLHF-LMs), and some studies have suggested that RLHF-LMs produce conditional probabilities that are very poorly calibrated. In light of this perceived weakness, we conduct a broad evaluation of methods for extracting confidence scores from RLHF-LMs. For RLHF-LMs such as ChatGPT, GPT-4, and Claude, we find that verbalized confidences emitted as output tokens are typically better-calibrated than the model's conditional probabilities on the TriviaQA, SciQ, and TruthfulQA benchmarks, often reducing the expected calibration error by a relative 50%.

PDF Abstract

Strategies for Eliciting Calibrated Confidence Scores from Fine-Tuned LLMs

The paper entitled "Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from LLMs Fine-Tuned with Human Feedback" by Katherine Tian et al. explores the challenge of calibrating confidence scores produced by LLMs that have been fine-tuned using reinforcement learning from human feedback (RLHF). The authors address a significant concern in deploying LLMs for real-world applications: the ability of these models to reliably express their confidence in predictions. This capability is crucial for building systems that can appropriately defer low-confidence predictions to human experts, thus enhancing trustworthiness and safety.

Key Findings

Calibration Challenges in RLHF-LMs: Fine-tuning LLMs with human feedback, although enhancing adherence to user instructions, often impairs the calibration of these models. This paper highlights that widely used RLHF-LMs like ChatGPT and GPT-4 typically exhibit poorly-calibrated confidence scores based on conditional probabilities, often leading to overconfidence.
Verbalized Confidence Scores: The authors propose methods to elicit calibrated confidence scores by leveraging verbalized forms of model expression. Empirical results across several benchmark datasets—TriviaQA, SciQ, and TruthfulQA—demonstrate that these verbalized confidence outputs are more well-calibrated than probabilities derived from model log probabilities. Specifically, they observe a relative reduction of about 50% in expected calibration errors using verbalized confidences.
Improvement Techniques: Inspired by psychological research demonstrating that considering alternative outcomes mitigates overconfidence, the paper explores prompting LLMs to produce multiple potential answers before assigning confidence scores. This approach shows a significant improvement in calibration accuracy.
Numerical Results: Comparative analyses reveal stark contrasts in calibration efficacy across models. For instance, verbalized scores using ChatGPT reduce expected calibration error significantly more than raw model probabilities. This pattern holds across other models like Claude and Llama-2-70B, albeit with varying degrees of efficacy.

Implications and Future Directions

Practical Applications: Improved calibration fidelity is instrumental in ensuring AI reliability in domains requiring high accuracy, such as healthcare and finance. The techniques proposed could be integrated into AI systems to enhance decision-making processes, where the consequence of failure can be critical.
Model Training and Optimization: The disparity observed between model families (e.g., GPT vs. Claude) in terms of verbalized calibration success suggests avenues for optimizing model training protocols. Fine-tuning strategies that bolster innate model capabilities to verbalize accurate confidence scores might be developed based on these insights.
Domain Generalization: Although these techniques display promise in fact-based tasks, their adaptation and efficacy in reasoning-intensive contexts require exploration. This paper lays foundational work; further insights into domain-specific calibration behaviors are essential for broader applicability.
Extended Analysis: While addressing the calibration of short-form answers effectively, subsequent studies might explore the calibration of long-form text generation, especially in supportive roles for human-centric decision-making workflows.

In sum, the research explores overcoming calibration shortfalls in RLHF-tuned LLMs, offering substantial contributions to improving AI reliability. It opens several avenues for further exploration in AI safety, calibration paradigms, and model interpretability, fostering progress toward deployment of more autonomous and trustworthy AI systems.