Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Internal State of an LLM Knows When It's Lying (2304.13734v2)

Published 26 Apr 2023 in cs.CL, cs.AI, and cs.LG

Abstract: While LLMs have shown exceptional performance in various tasks, one of their most prominent drawbacks is generating inaccurate or false information with a confident tone. In this paper, we provide evidence that the LLM's internal state can be used to reveal the truthfulness of statements. This includes both statements provided to the LLM, and statements that the LLM itself generates. Our approach is to train a classifier that outputs the probability that a statement is truthful, based on the hidden layer activations of the LLM as it reads or generates the statement. Experiments demonstrate that given a set of test sentences, of which half are true and half false, our trained classifier achieves an average of 71\% to 83\% accuracy labeling which sentences are true versus false, depending on the LLM base model. Furthermore, we explore the relationship between our classifier's performance and approaches based on the probability assigned to the sentence by the LLM. We show that while LLM-assigned sentence probability is related to sentence truthfulness, this probability is also dependent on sentence length and the frequencies of words in the sentence, resulting in our trained classifier providing a more reliable approach to detecting truthfulness, highlighting its potential to enhance the reliability of LLM-generated content and its practical applicability in real-world scenarios.

Introduction

LLMs have reshaped the landscape of natural language understanding and generation with their ability to perform well on diverse tasks. Yet, a critical problem persists: LLMs often produce statements that are inaccurate or outright false, yet presented with a veneer of confidence. Addressing this, a novel strategy has been introduced to discern the truthfulness of statements either inputted to or generated by LLMs, leveraging their internal states. This paper explores this innovative approach of utilizing a classifier trained on an LLM's hidden layer activations.

Methodology

Titled Statement Accuracy Prediction based on LLM Activations (SAPLMA), this method provides a classifier with the activation values from hidden layers of an LLM to predict the veracity of statements. It goes beyond the usual surface statistics like word frequency and sentence length, which are insufficient on their own for discerning true from false information. The classifier is trained using a purpose-built dataset comprising true and false statements in six different content areas, ensuring that it does not only hold for a specific area of knowledge. A unique aspect of the training process is applied: the classifier is trained on topics distinct from the one being evaluated, to ensure its general applicability and avoid the pitfalls of topic-specific training.

Performance and Findings

The paper unveils that SAPLMA consistently outperforms several baseline models in classifying statements as true or false. Notably, it achieves accuracy levels ranging from 71% to 83%, depending on the base LLM model used. This performance exhibits a marked improvement over the relatively low ceiling of accuracy obtainable via few-shot prompting approaches, which hover around 56%. Insights from the evaluation show that different hidden layers of the LLM might surface as more useful for prediction purposes, but a one-size-fits-all layer is not established, suggesting that the optimal layer for SAPLMA may vary depending on the specific LLM in use.

Implications and Future Directions

SAPLMA's promise rests in its potential to inform and correct LLM outputs before reaching end-users, hence improving the LLM's trustworthiness. The methodology centers on tapping into the innate knowledge encapsulated within an LLM to rectify its own outputs. Future research may expand to other LLMs, multi-language implementations, and human interaction studies to measure user trust. Also on the agenda is examining the evolution of hidden activations over the course of text generation and the possibility of multilinear input handling.

The supplied true-false dataset is an assets in itself, availing researchers of a valuable resource for continued exploration in this vein. Concerning limitations, SAPLMA's threshold for classifying a statement as true might need calibration, and the approach currently centers on English, warranting cross-linguistic evaluations in subsequent works.

Conclusion

The research adds a compelling chapter to the ongoing narrative of LLM refinement. It exposes an internal assessment mechanism within LLMs that could be harnessed for verifying the veracity of generated content. As discussions around ethical AI practice and misinformation grow, tools like SAPLMA stand out as significant steps towards responsible LLM deployment.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)
  1. Fine-tuning language models to find agreement among humans with diverse preferences. Advances in Neural Information Processing Systems, 35:38176–38189.
  2. Calibration of machine learning models. In Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques, pages 128–146. IGI Global.
  3. Michael Bommarito II and Daniel Martin Katz. 2022. Gpt takes the bar exam. arXiv preprint arXiv:2212.14402.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  5. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
  6. Discovering latent knowledge in language models without supervision. arXiv preprint arXiv:2212.03827.
  7. Yuanyuan Chen and Zhang Yi. 2021. Adaptive sparse dropout: Learning the certainty and uncertainty in deep neural networks. Neurocomputing, 450:354–361.
  8. Detecting and mitigating hallucinations in machine translation: Model internal workings alone do well, sentence similarity even better. arXiv preprint arXiv:2212.08597.
  9. Wizard of wikipedia: Knowledge-powered conversational agents. arXiv preprint arXiv:1811.01241.
  10. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378.
  11. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.
  12. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526.
  13. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  14. Understanding factuality in abstractive summarization with frank: A benchmark for factuality metrics. arXiv preprint arXiv:2104.13346.
  15. Check your facts and try again: Improving large language models with external knowledge and automated feedback. arXiv preprint arXiv:2302.12813.
  16. Llama 2: Early adopters’ utilization of meta’s new open-source pretrained model.
  17. Fever: a large-scale dataset for fact extraction and verification. arXiv preprint arXiv:1803.05355.
  18. The fever2. 0 shared task. In Proceedings of the Second Workshop on Fact Extraction and VERification (FEVER), pages 1–6.
  19. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Amos Azaria (33 papers)
  2. Tom Mitchell (27 papers)
Citations (237)
Youtube Logo Streamline Icon: https://streamlinehq.com