Emergent Mind

Language Models (Mostly) Know What They Know

Published Jul 11, 2022 in cs.CL , cs.AI , and cs.LG


We study whether language models can evaluate the validity of their own claims and predict which questions they will be able to answer correctly. We first show that larger models are well-calibrated on diverse multiple choice and true/false questions when they are provided in the right format. Thus we can approach self-evaluation on open-ended sampling tasks by asking models to first propose answers, and then to evaluate the probability "P(True)" that their answers are correct. We find encouraging performance, calibration, and scaling for P(True) on a diverse array of tasks. Performance at self-evaluation further improves when we allow models to consider many of their own samples before predicting the validity of one specific possibility. Next, we investigate whether models can be trained to predict "P(IK)", the probability that "I know" the answer to a question, without reference to any particular proposed answer. Models perform well at predicting P(IK) and partially generalize across tasks, though they struggle with calibration of P(IK) on new tasks. The predicted P(IK) probabilities also increase appropriately in the presence of relevant source materials in the context, and in the presence of hints towards the solution of mathematical word problems. We hope these observations lay the groundwork for training more honest models, and for investigating how honesty generalizes to cases where models are trained on objectives other than the imitation of human writing.


  • The study investigates the self-evaluation capabilities of Language Models (LMs), focusing on their ability to assess the validity of their own outputs.

  • Research reveals that large LMs are capable of accurately calibrating probabilities on multiple-choice questions, enhancing their reliability.

  • The paper introduces P(IK), a metric for evaluating whether models 'know' the answers to questions, and finds models can predict their own knowledge accurately.

  • Implications include the potential for creating more reliable and transparent AI systems, with a call for further exploration in model scaling and beyond language tasks.


The recent study conducted by Kadavath et al. addresses a crucial aspect in the development of AI systems, particularly Language Models (LMs): the capacity for these systems to evaluate the validity of their own outputs, or their self-evaluation capabilities. This research investigates to what extent LMs can accurately determine whether they "know" the answers to questions posed to them. This capability is fundamental for enabling AI systems to function with a degree of honesty, by acknowledging the limits of their knowledge and thereby providing more reliable and trustworthy outputs. The analysis commences with assessing the calibration of language models on multiple-choice questions. It then delves into models' self-evaluation on True/False tasks and their ability to predict whether they can correctly answer questions, introducing the concept of "P(IK)," denoting the probability that a model "knows" the answer.

Calibration and Self-Evaluation

The research reveals that large LMs exhibit promising calibration on various multiple-choice questions, suggesting that with appropriate formatting, these models can approximate the probability of certain outcomes accurately. Central to harnessing this calibration capability is the format in which questions are provided to the models. The study highlights that the visible presentation of lettered answer options significantly enhances the models' calibration performance. The improvement in calibration with model size further suggests that model capabilities play a crucial role in this context.

Moving beyond calibration, the investigation extends to a model's ability to self-evaluate its outputs. This self-evaluation involves the model assessing the probability - termed P(True) in the study - that a given sample answer it generated is correct. The introduction of a context where models could consider multiple samples before making a prediction allowed for improved self-evaluation. This suggests that exposing models to a breadth of potential answers (akin to brainstorming) before settling on a specific probability enhances their evaluative accuracy.

Predicting "Knowing"

Perhaps the most intriguing aspect of this research is the exploration into models' capabilities to predict their own knowledge accurately, using the P(IK) metric. Here, the authors find that models are not only capable of distinguishing questions they can accurately answer from those they cannot, but they also demonstrate this ability across different tasks and domains. This functionality was particularly highlighted in instances where background information or hints were provided, and the model's P(IK) would adjust accordingly, indicating an awareness of when additional context made a question answerable.

Implications and Future Developments

The implications of these findings are manifold. Practically, the ability of LMs to self-evaluate and predict their knowledge accurately opens new avenues for creating more reliable and transparent AI systems. Theoretically, it pushes the boundary of understanding how these models process, evaluate, and apply knowledge.

Looking ahead, the researchers acknowledge several limitations, including the need to further investigate how these capabilities scale across models of varying sizes and are affected by different training conditions. Moreover, understanding how these self-evaluation capabilities translate to models trained on tasks beyond language is an area ripe for exploration.

In conclusion, the study by Kadavath et al. makes significant strides in understanding the self-evaluation capabilities of language models. It not only sheds light on how models can become more transparent and reliable but also sets the stage for future research aimed at creating AI systems capable of recognizing and admitting the limits of their knowledge.

Get summaries of trending AI papers delivered straight to your inbox

Unsubscribe anytime.

Test Your Knowledge

You answered out of questions correctly.

Well done!