Retrieval Head Mechanistically Explains Long-Context Factuality (2404.15574v1)

Published 24 Apr 2024 in cs.CL

Abstract: Despite the recent progress in long-context LLMs, it remains elusive how transformer-based models exhibit the capability to retrieve relevant information from arbitrary locations within the long context. This paper aims to address this question. Our systematic investigation across a wide spectrum of models reveals that a special type of attention heads are largely responsible for retrieving information, which we dub retrieval heads. We identify intriguing properties of retrieval heads:(1) universal: all the explored models with long-context capability have a set of retrieval heads; (2) sparse: only a small portion (less than 5\%) of the attention heads are retrieval. (3) intrinsic: retrieval heads already exist in models pretrained with short context. When extending the context length by continual pretraining, it is still the same set of heads that perform information retrieval. (4) dynamically activated: take Llama-2 7B for example, 12 retrieval heads always attend to the required information no matter how the context is changed. The rest of the retrieval heads are activated in different contexts. (5) causal: completely pruning retrieval heads leads to failure in retrieving relevant information and results in hallucination, while pruning random non-retrieval heads does not affect the model's retrieval ability. We further show that retrieval heads strongly influence chain-of-thought (CoT) reasoning, where the model needs to frequently refer back the question and previously-generated context. Conversely, tasks where the model directly generates the answer using its intrinsic knowledge are less impacted by masking out retrieval heads. These observations collectively explain which internal part of the model seeks information from the input tokens. We believe our insights will foster future research on reducing hallucination, improving reasoning, and compressing the KV cache.

PDF Abstract

Investigating the Role of Retrieval Heads in LLMs for Long-context Information Processing

Introduction

This paper presents a systematic paper aimed at understanding the role and functioning of a specific type of attention heads, referred to as "retrieval heads," within transformer-based LLMs that handle long-context data. Attention mechanisms in such models enable the retrieval of relevant information across extended passages of text. By conducting experiments across four model families, six model scales, and three types of finetuning, the researchers identify properties of retrieval heads that significantly affect how LLMs access and use information from long data sequences.

Detection of Retrieval Heads

The researchers developed a method to detect retrieval heads based on their specific behavior - the frequency with which they copy tokens from input to output during model predictions. This method involves a procedure called the Needle-in-a-Haystack test, designed to evaluate the ability of models to retrieve exact information embedded randomly in a larger context. The outcome of this method was a "retrieval score," which quantifies the efficiency of heads in copying relevant tokens during the generation process. This score allowed the identification of retrieval heads among numerous attention heads, pointing out that typically less than 5% of heads in any tested model displayed significant retrieval behavior.

Characteristics of Retrieval Heads

The paper outlines several key characteristics of retrieval heads:

Universal and Sparse Existence: Retrieval heads are a sparse yet universal phenomenon among tested LLMs capable of handling long contexts, regardless of the model architecture or training specifics.
Dynamic Activation: The activation of these heads depends largely on the context, with some heads always active while others only in specific scenarios.
Intrinsic to Models: Intriguingly, these heads appear inherently in models, even those pre-trained on shorter contexts, and continue to function similarly even after extensive further training or modifications in model architecture.

Impact on Model Outputs

Extensive experiments demonstrate that the activation of retrieval heads is crucial for the model's ability to retrieve accurate information from long passages. When these heads are disabled or not fully operational, the model's performance diminishes, often leading to the generation of inaccurate or hallucinated content. These heads also play a significant role in more complex reasoning tasks, such as chain-of-thought processes, where the model needs to maintain and access a trail of prior thoughts or inputs to generate coherent and contextually faithful outputs.

Implications and Future Directions

The findings from this research underscore the critical role of retrieval heads in managing long-context data within LLMs and highlight potential pathways for enhancing model design. Considerations for future work include exploring mechanisms to tech these model components more explicitly during training phases or devising model architectures that enable more efficient or robust information retrieval capacities. Moreover, understanding the specific workings of these retrieval heads could lead to better strategies for model compression and deployment, particularly in optimizing the storage and computation overhead associated with maintaining long context windows in operational settings.

Conclusion

The paper solidifies the understanding of how certain types of attention heads, specifically retrieval heads, contribute fundamentally to the model's ability to process and utilize long-sequence data efficiently. It paves the way for refined model architectures and training strategies that could further harness the power of LLMs in applications requiring the processing of extensive textual data.