Language Models (Mostly) Know What They Know (2207.05221v4)

Published 11 Jul 2022 in cs.CL, cs.AI, and cs.LG

Abstract: We study whether LLMs can evaluate the validity of their own claims and predict which questions they will be able to answer correctly. We first show that larger models are well-calibrated on diverse multiple choice and true/false questions when they are provided in the right format. Thus we can approach self-evaluation on open-ended sampling tasks by asking models to first propose answers, and then to evaluate the probability "P(True)" that their answers are correct. We find encouraging performance, calibration, and scaling for P(True) on a diverse array of tasks. Performance at self-evaluation further improves when we allow models to consider many of their own samples before predicting the validity of one specific possibility. Next, we investigate whether models can be trained to predict "P(IK)", the probability that "I know" the answer to a question, without reference to any particular proposed answer. Models perform well at predicting P(IK) and partially generalize across tasks, though they struggle with calibration of P(IK) on new tasks. The predicted P(IK) probabilities also increase appropriately in the presence of relevant source materials in the context, and in the presence of hints towards the solution of mathematical word problems. We hope these observations lay the groundwork for training more honest models, and for investigating how honesty generalizes to cases where models are trained on objectives other than the imitation of human writing.

Citations (571)

View on Semantic Scholar

Summary

The paper demonstrates that language models can assess their knowledge using the P(IK) metric to predict when they truly know an answer.
The study highlights that presenting visible answer options improves model calibration and self-evaluation on multiple-choice and true/false tasks.
The research shows that enabling models to consider multiple responses enhances their ability to reliably determine the correctness of their outputs.

Exploring the Self-Evaluation Capabilities of LLMs

Introduction

The paper conducted by Kadavath et al. addresses a crucial aspect in the development of AI systems, particularly LLMs (LMs): the capacity for these systems to evaluate the validity of their own outputs, or their self-evaluation capabilities. This research investigates to what extent LMs can accurately determine whether they "know" the answers to questions posed to them. This capability is fundamental for enabling AI systems to function with a degree of honesty, by acknowledging the limits of their knowledge and thereby providing more reliable and trustworthy outputs. The analysis commences with assessing the calibration of LLMs on multiple-choice questions. It then explores models' self-evaluation on True/False tasks and their ability to predict whether they can correctly answer questions, introducing the concept of "P(IK)," denoting the probability that a model "knows" the answer.

Calibration and Self-Evaluation

The research reveals that large LMs exhibit promising calibration on various multiple-choice questions, suggesting that with appropriate formatting, these models can approximate the probability of certain outcomes accurately. Central to harnessing this calibration capability is the format in which questions are provided to the models. The paper highlights that the visible presentation of lettered answer options significantly enhances the models' calibration performance. The improvement in calibration with model size further suggests that model capabilities play a crucial role in this context.

Moving beyond calibration, the investigation extends to a model's ability to self-evaluate its outputs. This self-evaluation involves the model assessing the probability - termed P(True) in the paper - that a given sample answer it generated is correct. The introduction of a context where models could consider multiple samples before making a prediction allowed for improved self-evaluation. This suggests that exposing models to a breadth of potential answers (akin to brainstorming) before settling on a specific probability enhances their evaluative accuracy.

Predicting "Knowing"

Perhaps the most intriguing aspect of this research is the exploration into models' capabilities to predict their own knowledge accurately, using the P(IK) metric. Here, the authors find that models are not only capable of distinguishing questions they can accurately answer from those they cannot, but they also demonstrate this ability across different tasks and domains. This functionality was particularly highlighted in instances where background information or hints were provided, and the model's P(IK) would adjust accordingly, indicating an awareness of when additional context made a question answerable.

Implications and Future Developments

The implications of these findings are manifold. Practically, the ability of LMs to self-evaluate and predict their knowledge accurately opens new avenues for creating more reliable and transparent AI systems. Theoretically, it pushes the boundary of understanding how these models process, evaluate, and apply knowledge.

Looking ahead, the researchers acknowledge several limitations, including the need to further investigate how these capabilities scale across models of varying sizes and are affected by different training conditions. Moreover, understanding how these self-evaluation capabilities translate to models trained on tasks beyond language is an area ripe for exploration.

In conclusion, the paper by Kadavath et al. makes significant strides in understanding the self-evaluation capabilities of LLMs. It not only sheds light on how models can become more transparent and reliable but also sets the stage for future research aimed at creating AI systems capable of recognizing and admitting the limits of their knowledge.

PDF Markdown

Related Papers

Tweets

https://twitter.com/rohanpaul_ai/status/1830354770076340585

https://twitter.com/hrosspet/status/1859028346958664110

https://twitter.com/gregd_nlp/status/1822709800808985029

https://twitter.com/yan_hanqi/status/1765337313997996073

https://twitter.com/llllvvuu/status/1824352596636864592

https://twitter.com/anmorgan2414/status/1897693815718203413

YouTube

Show All Videos