Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering (2012.00955v2)

Published 2 Dec 2020 in cs.CL
How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering

Abstract: Recent works have shown that LLMs (LM) capture different types of knowledge regarding facts or common sense. However, because no model is perfect, they still fail to provide appropriate answers in many cases. In this paper, we ask the question "how can we know when LLMs know, with confidence, the answer to a particular query?" We examine this question from the point of view of calibration, the property of a probabilistic model's predicted probabilities actually being well correlated with the probabilities of correctness. We examine three strong generative models -- T5, BART, and GPT-2 -- and study whether their probabilities on QA tasks are well calibrated, finding the answer is a relatively emphatic no. We then examine methods to calibrate such models to make their confidence scores correlate better with the likelihood of correctness through fine-tuning, post-hoc probability modification, or adjustment of the predicted outputs or inputs. Experiments on a diverse range of datasets demonstrate the effectiveness of our methods. We also perform analysis to study the strengths and limitations of these methods, shedding light on further improvements that may be made in methods for calibrating LMs. We have released the code at https://github.com/jzbjyb/lm-calibration.

On the Calibration of LLMs for Question Answering

The paper "How Can We Know When LLMs Know? On the Calibration of LLMs for Question Answering" presents a critical analysis of the calibration properties of LLMs (LMs) concerning their application to question answering (QA) tasks. It investigates the degree to which these models' probability estimates align with the actual likelihood of correctness, a key issue for their reliability, especially in domains demanding high stakes, such as healthcare.

The authors evaluate three prominent LMs—T5, BART, and GPT-2—focusing on whether these models' probabilistic predictions for QA accurately reflect true confidence levels. Through an empirical evaluation, they reveal that despite high accuracy, these models exhibit poor calibration. Their confidence scores do not correlate well with the correctness probability, which could lead to potentially unreliable deployments of such models in real-world applications.

To address these calibration discrepancies, the paper outlines a set of strategies divided into fine-tuning and post-hoc methods. Fine-tuning approaches use softmax- or margin-based objective functions to adjust the parameters of LMs to better align their confidence scores with the likelihood of correctness. Post-hoc methods operate without altering the LM parameters, instead manipulating confidence values through temperature-based scaling or employing feature-based regressors like decision trees to recalibrate confidence estimates.

The efficacy of these approaches is validated across a diverse set of QA datasets. Fine-tuning methods show potential in improving calibration without deteriorating accuracy. Meanwhile, post-hoc techniques like temperature scaling also yield improvement by refining the distribution of predicted probabilities, effectively counteracting the observed overconfidence of LMs.

A unique aspect of this research involves exploring LM-specific interventions, such as paraphrasing candidate answers and augmenting inputs with additional context via retrieval. These methods leverage the models' sensitivity to linguistic variations and input data to further enhance calibration and accuracy.

The broader implications of this paper underscore the essential nature of reliable calibration for the deployment of LLMs in practical scenarios. The research illuminates the path toward more dependable AI systems, advocating for further exploration into calibration across various tasks and model configurations. Future work might consider granular calibration based on specific domains or user interactions, considering the impact of confidence estimation on downstream decision-making and the user-agency.

In sum, this paper provides a rigorous exploration of calibration in LLMs, a pertinent factor determining their trustworthiness in QA applications. Their findings and methodologies not only have immediate applicability but also suggest iterative paths for enriching the reliability and accountability of AI models in complex, real-world environments.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Zhengbao Jiang (25 papers)
  2. Jun Araki (11 papers)
  3. Haibo Ding (11 papers)
  4. Graham Neubig (342 papers)
Citations (353)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub