Attention Sorting Combats Recency Bias In Long Context Language Models (2310.01427v1)

Published 28 Sep 2023 in cs.CL and cs.AI

Abstract: Current LLMs often fail to incorporate long contexts efficiently during generation. We show that a major contributor to this issue are attention priors that are likely learned during pre-training: relevant information located earlier in context is attended to less on average. Yet even when models fail to use the information from a relevant document in their response, they still pay preferential attention to that document compared to an irrelevant document at the same position. We leverage this fact to introduce ``attention sorting'': perform one step of decoding, sort documents by the attention they receive (highest attention going last), repeat the process, generate the answer with the newly sorted context. We find that attention sorting improves performance of long context models. Our findings highlight some challenges in using off-the-shelf LLMs for retrieval augmented generation.

PDF HTML Abstract

Analysis of "Attention Sorting Combats Recency Bias In Long Context LLMs"

In the paper "Attention Sorting Combats Recency Bias In Long Context LLMs," the authors Alex Peysakhovich and Adam Lerer explore a critical issue in the application of LLMs: the effective utilization of long context windows. Through rigorous experimentation, they identify a recency bias in the attention mechanisms of such models, which can hinder performance in tasks requiring the integration of dispersed context. This bias means models tend to prioritize recent tokens more than older, potentially relevant ones.

Primary Contributions

The authors introduce a novel method called "attention sorting," addressing the challenges faced by LLMs in retrieval augmented generation (RAG) tasks. The technique involves:

Attention Evaluation: During a decoding step, document attention is calculated, allowing the model to determine which parts of the context it focused on most.
Reordering Method: Documents are sorted based on attention scores, positioning those with the highest attention last, followed by generating responses with this reordered context.
Iterative Refinement: The process can be repeated multiple times to further optimize document ordering based on attention scores, ultimately enhancing model accuracy in context-rich tasks.

Key Experimental Findings

The paper employs the SynthWiki dataset, a synthetic long-context extractive QA setup, to simulate scenarios wherein LLMs must derive specific information from a pool of documents, effectively isolating the impact of retrieval and attention mechanics free from pretraining biases.

Significant findings include:

Model Performance Degradation: An increase in distractor documents predictably reduces accuracy across both open-source and proprietary models.
Attention Bias Patterns: Through analysis, a marked recency bias was observed, with models consistently attending more to recent documents.
Effectiveness of Attention Sorting: Implementation of attention sorting resulted in substantial performance improvements, especially in long-context scenarios, effectively counteracting the biased attention allocation.

Implications

The results highlight a practical approach to mitigate limitations in current LLM architectures that are often trained with inherent biases from their pre-trained context alignment tasks. This bias can hinder the performance in tasks that require longer context comprehension, such as complex QA and summarization tasks.

Theoretical Impact and Future Directions

From a theoretical perspective, this paper underscores the importance of understanding learned attention patterns in LLMs and adapting them for specific tasks where standard pre-training does not apply. It prompts future exploration into methods for harmonizing LLM training objectives more closely with real-world RAG tasks, possibly through integrated systems that fine-tune both retrieval and generation simultaneously.

In conclusion, "Attention Sorting Combats Recency Bias In Long Context LLMs" presents a compelling argument for refining context manipulation strategies in LLMs. As AI applications continue to evolve, understanding and optimizing how models incorporate vast contexts will remain integral to their success. The insights from this paper provide a foundational step towards achieving better context management in these complex systems.