The Internal State of an LLM Knows When It's Lying (2304.13734v2)

Published 26 Apr 2023 in cs.CL, cs.AI, and cs.LG

Abstract: While LLMs have shown exceptional performance in various tasks, one of their most prominent drawbacks is generating inaccurate or false information with a confident tone. In this paper, we provide evidence that the LLM's internal state can be used to reveal the truthfulness of statements. This includes both statements provided to the LLM, and statements that the LLM itself generates. Our approach is to train a classifier that outputs the probability that a statement is truthful, based on the hidden layer activations of the LLM as it reads or generates the statement. Experiments demonstrate that given a set of test sentences, of which half are true and half false, our trained classifier achieves an average of 71\% to 83\% accuracy labeling which sentences are true versus false, depending on the LLM base model. Furthermore, we explore the relationship between our classifier's performance and approaches based on the probability assigned to the sentence by the LLM. We show that while LLM-assigned sentence probability is related to sentence truthfulness, this probability is also dependent on sentence length and the frequencies of words in the sentence, resulting in our trained classifier providing a more reliable approach to detecting truthfulness, highlighting its potential to enhance the reliability of LLM-generated content and its practical applicability in real-world scenarios.

PDF HTML Abstract

Introduction

LLMs have reshaped the landscape of natural language understanding and generation with their ability to perform well on diverse tasks. Yet, a critical problem persists: LLMs often produce statements that are inaccurate or outright false, yet presented with a veneer of confidence. Addressing this, a novel strategy has been introduced to discern the truthfulness of statements either inputted to or generated by LLMs, leveraging their internal states. This paper explores this innovative approach of utilizing a classifier trained on an LLM's hidden layer activations.

Methodology

Titled Statement Accuracy Prediction based on LLM Activations (SAPLMA), this method provides a classifier with the activation values from hidden layers of an LLM to predict the veracity of statements. It goes beyond the usual surface statistics like word frequency and sentence length, which are insufficient on their own for discerning true from false information. The classifier is trained using a purpose-built dataset comprising true and false statements in six different content areas, ensuring that it does not only hold for a specific area of knowledge. A unique aspect of the training process is applied: the classifier is trained on topics distinct from the one being evaluated, to ensure its general applicability and avoid the pitfalls of topic-specific training.

Performance and Findings

The paper unveils that SAPLMA consistently outperforms several baseline models in classifying statements as true or false. Notably, it achieves accuracy levels ranging from 71% to 83%, depending on the base LLM model used. This performance exhibits a marked improvement over the relatively low ceiling of accuracy obtainable via few-shot prompting approaches, which hover around 56%. Insights from the evaluation show that different hidden layers of the LLM might surface as more useful for prediction purposes, but a one-size-fits-all layer is not established, suggesting that the optimal layer for SAPLMA may vary depending on the specific LLM in use.

Implications and Future Directions

SAPLMA's promise rests in its potential to inform and correct LLM outputs before reaching end-users, hence improving the LLM's trustworthiness. The methodology centers on tapping into the innate knowledge encapsulated within an LLM to rectify its own outputs. Future research may expand to other LLMs, multi-language implementations, and human interaction studies to measure user trust. Also on the agenda is examining the evolution of hidden activations over the course of text generation and the possibility of multilinear input handling.

The supplied true-false dataset is an assets in itself, availing researchers of a valuable resource for continued exploration in this vein. Concerning limitations, SAPLMA's threshold for classifying a statement as true might need calibration, and the approach currently centers on English, warranting cross-linguistic evaluations in subsequent works.

Conclusion

The research adds a compelling chapter to the ongoing narrative of LLM refinement. It exposes an internal assessment mechanism within LLMs that could be harnessed for verifying the veracity of generated content. As discussions around ethical AI practice and misinformation grow, tools like SAPLMA stand out as significant steps towards responsible LLM deployment.