Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models (2411.14257v2)

Published 21 Nov 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Hallucinations in LLMs are a widespread problem, yet the mechanisms behind whether models will hallucinate are poorly understood, limiting our ability to solve this problem. Using sparse autoencoders as an interpretability tool, we discover that a key part of these mechanisms is entity recognition, where the model detects if an entity is one it can recall facts about. Sparse autoencoders uncover meaningful directions in the representation space, these detect whether the model recognizes an entity, e.g. detecting it doesn't know about an athlete or a movie. This suggests that models can have self-knowledge: internal representations about their own capabilities. These directions are causally relevant: capable of steering the model to refuse to answer questions about known entities, or to hallucinate attributes of unknown entities when it would otherwise refuse. We demonstrate that despite the sparse autoencoders being trained on the base model, these directions have a causal effect on the chat model's refusal behavior, suggesting that chat finetuning has repurposed this existing mechanism. Furthermore, we provide an initial exploration into the mechanistic role of these directions in the model, finding that they disrupt the attention of downstream heads that typically move entity attributes to the final token.

Summary

The paper demonstrates that LLMs encode self-knowledge about entities through latent directions, which directly influences whether they provide accurate responses or hallucinate details.
It employs sparse autoencoders to identify latent directions that effectively separate known from unknown entities, with high separation scores reflecting stronger factual associations.
The research shows that manipulating these latent directions can steer model behavior to either answer, refuse to answer, or generate hallucinations, paving the way for safer, more reliable systems.

This paper is important because it digs into one of the key challenges facing LLMs: their tendency to generate factually incorrect information (often called hallucinations) and to sometimes refuse to answer when they lack the necessary information. Understanding why these models sometimes hallucinate or refuse to answer is crucial for improving their reliability, especially in applications where accuracy matters.

Below is a detailed explanation of the main ideas and methods presented in the work:

Background and Motivation

LLMs are very good at generating fluent text, but they sometimes produce outputs that are false or misleading. This happens when the model is asked about an entity (like a movie, athlete, or city) that it does not have strong factual information about. The paper seeks to understand what happens inside these models when they decide whether to provide an answer, hallucinate details, or simply refuse to answer.

The authors explore the idea that models may have a form of self-knowledge—they internally represent whether they “know” a particular entity.
This self-knowledge influences whether the model produces a factual answer, a hallucination, or a refusal.

The Role of Sparse Autoencoders

To uncover these internal mechanisms, the paper uses sparse autoencoders (SAEs). SAEs are tools that help break down complex internal representations into simpler, interpretable components. In this paper, they are used to find specific “directions” in the model’s internal representation space that correspond to whether an entity is known or unknown.

Latent Directions: Think of the model’s internal state as a long list of numbers. Some specific patterns in these numbers can tell us if the model is confident about an entity or if it is uncertain. The paper identifies two types of these directions:
- One that becomes active when the model recognizes a well-known entity.
- Another that activates when the model encounters an entity it does not recognize.
Separation Score: The authors compute a score for each latent direction by measuring how often it activates on known entities versus unknown entities. In simple terms, if a component fires much more often when the entity is known than when it is unknown, it gets a high separation score. One way they express this is by comparing the frequency of activation on each type:

$s_{\text{known}} = f_{\text{known}} - f_{\text{unknown}}$

Here, $f_{\text{known}}$ represents the fraction of times the latent fires for known entities, and $f_{\text{unknown}}$ represents the fraction for unknown entities.
A high value of $s_{\text{known}}$ suggests a strong association with known entities.

Experiments and Findings

The research team built a dataset using four types of entities: players, movies, cities, and songs. They measured how often different latent directions activated when the model processed prompts about these entities. The key findings include:

Causal Influence on Behavior: The authors showed that by “steering” the model—that is, by artificially increasing or decreasing the activation of these latent directions—they could control whether the model provided an answer, refused to answer, or hallucinated information.
- For example, when the latent direction associated with unknown entities was boosted, the model almost always refused to answer questions about the entity.
- Conversely, when the latent linked with known entities was increased on a prompt about an unknown entity, the model was more likely to hallucinate details it did not have.
Attention Mechanisms: The paper also explored how these latent directions affect attention within the model. In LLMs, attention is the mechanism that helps determine which parts of the input to focus on. The paper found that the latent direction for known entities increases the model’s attention on the entity token (the part of the input referring to the entity), helping to extract relevant information. On the other hand, activating the unknown entity latent reduces this attention, which leads to a lack of reliable factual extraction.
Uncertainty Representation: Beyond simply recognizing known versus unknown, the work identified additional latent directions that correlate with how uncertain the model is when generating an answer. These “uncertainty directions” can predict when the model is likely to provide a wrong or hallucinated answer even if it does not outright refuse to respond.

Implications and Recommendations

Understanding these internal mechanisms has several practical benefits:

Improving Model Reliability: By identifying the internal signals that indicate whether the model “knows” an entity, researchers and developers can design interventions to improve the model’s accuracy and reduce hallucinations. For instance, adjusting how the model handles unknown entities might lead to more appropriate refusals rather than fabricating details.
Designing Safer Systems: In applications like healthcare or legal advice, having models that can correctly decide not to answer when uncertain is essential. The ability to steer the model’s behavior by manipulating these latent directions provides a potential pathway to safer, more dependable systems.
Further Research: The use of sparse autoencoders provides a window into the high-dimensional internal workings of LLMs. This approach opens up avenues for future work to explore other aspects of self-knowledge and uncertainty in models.

Concluding Summary

In summary, the paper shows that LLMs internally encode a form of self-knowledge about the entities they are asked about using certain latent directions. By using sparse autoencoders to identify these directions and then steering them, the researchers demonstrated that it is possible to causally influence whether a model will provide an answer, hallucinate, or refuse to answer a question. This insight not only aids in understanding how LLMs work internally but also points the way toward practical techniques for improving the factual reliability and safety of these models.

This detailed breakdown should give you a clearer picture of the research, its methods, and its significance without requiring a deep technical background.

PDF Markdown

Related Papers

Tweets

https://twitter.com/rohanpaul_ai/status/1864105928456589636

https://twitter.com/javifer_96/status/1889268241618305177

https://twitter.com/srivatsamath/status/1878573938797920304

https://twitter.com/srivatsamath/status/1887632884309368933

https://twitter.com/NeelNanda5/status/1889382600365711857

https://twitter.com/arXivGPT/status/1860749369374310442