Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Attention Sorting Combats Recency Bias In Long Context Language Models (2310.01427v1)

Published 28 Sep 2023 in cs.CL and cs.AI
Attention Sorting Combats Recency Bias In Long Context Language Models

Abstract: Current LLMs often fail to incorporate long contexts efficiently during generation. We show that a major contributor to this issue are attention priors that are likely learned during pre-training: relevant information located earlier in context is attended to less on average. Yet even when models fail to use the information from a relevant document in their response, they still pay preferential attention to that document compared to an irrelevant document at the same position. We leverage this fact to introduce ``attention sorting'': perform one step of decoding, sort documents by the attention they receive (highest attention going last), repeat the process, generate the answer with the newly sorted context. We find that attention sorting improves performance of long context models. Our findings highlight some challenges in using off-the-shelf LLMs for retrieval augmented generation.

Analysis of "Attention Sorting Combats Recency Bias In Long Context LLMs"

In the paper "Attention Sorting Combats Recency Bias In Long Context LLMs," the authors Alex Peysakhovich and Adam Lerer explore a critical issue in the application of LLMs: the effective utilization of long context windows. Through rigorous experimentation, they identify a recency bias in the attention mechanisms of such models, which can hinder performance in tasks requiring the integration of dispersed context. This bias means models tend to prioritize recent tokens more than older, potentially relevant ones.

Primary Contributions

The authors introduce a novel method called "attention sorting," addressing the challenges faced by LLMs in retrieval augmented generation (RAG) tasks. The technique involves:

  1. Attention Evaluation: During a decoding step, document attention is calculated, allowing the model to determine which parts of the context it focused on most.
  2. Reordering Method: Documents are sorted based on attention scores, positioning those with the highest attention last, followed by generating responses with this reordered context.
  3. Iterative Refinement: The process can be repeated multiple times to further optimize document ordering based on attention scores, ultimately enhancing model accuracy in context-rich tasks.

Key Experimental Findings

The paper employs the SynthWiki dataset, a synthetic long-context extractive QA setup, to simulate scenarios wherein LLMs must derive specific information from a pool of documents, effectively isolating the impact of retrieval and attention mechanics free from pretraining biases.

Significant findings include:

  • Model Performance Degradation: An increase in distractor documents predictably reduces accuracy across both open-source and proprietary models.
  • Attention Bias Patterns: Through analysis, a marked recency bias was observed, with models consistently attending more to recent documents.
  • Effectiveness of Attention Sorting: Implementation of attention sorting resulted in substantial performance improvements, especially in long-context scenarios, effectively counteracting the biased attention allocation.

Implications

The results highlight a practical approach to mitigate limitations in current LLM architectures that are often trained with inherent biases from their pre-trained context alignment tasks. This bias can hinder the performance in tasks that require longer context comprehension, such as complex QA and summarization tasks.

Theoretical Impact and Future Directions

From a theoretical perspective, this paper underscores the importance of understanding learned attention patterns in LLMs and adapting them for specific tasks where standard pre-training does not apply. It prompts future exploration into methods for harmonizing LLM training objectives more closely with real-world RAG tasks, possibly through integrated systems that fine-tune both retrieval and generation simultaneously.

In conclusion, "Attention Sorting Combats Recency Bias In Long Context LLMs" presents a compelling argument for refining context manipulation strategies in LLMs. As AI applications continue to evolve, understanding and optimizing how models incorporate vast contexts will remain integral to their success. The insights from this paper provide a foundational step towards achieving better context management in these complex systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Anthropic. Prompt engineering for claude’s long context window, 2023. Accessed: 2023-09-25.
  2. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
  3. The elephant in the interpretability room: Why use attention as explanation when we have saliency methods? arXiv preprint arXiv:2010.05607, 2020.
  4. Analysis methods in neural language processing: A survey. Transactions of the Association for Computational Linguistics, 7:49–72, 2019.
  5. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
  6. End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
  7. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023.
  8. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
  9. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
  10. Retrieval augmented language model pre-training. In International conference on machine learning, pages 3929–3938. PMLR, 2020.
  11. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751, 2019.
  12. Distilling knowledge from reader to retriever for question answering. arXiv preprint arXiv:2012.04584, 2020.
  13. Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906, 2020.
  14. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
  15. Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172, 2023.
  16. Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568, 2023.
  17. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114, 2021.
  18. the-little-retrieval-test, 2023.
  19. Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071, 2023.
  20. Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409, 2021.
  21. In-context retrieval-augmented language models. arXiv preprint arXiv:2302.00083, 2023.
  22. Transformer protein language models are unsupervised structure learners. Biorxiv, pages 2020–12, 2020.
  23. Long-range language modeling with self-retrieval. arXiv preprint arXiv:2306.13421, 2023.
  24. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023.
  25. Is attention interpretable? arXiv preprint arXiv:1906.03731, 2019.
  26. Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652, 2023.
  27. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.
  28. Do long-range language models actually use long-range context? arXiv preprint arXiv:2109.09115, 2021.
  29. Long range arena: A benchmark for efficient transformers. arXiv preprint arXiv:2011.04006, 2020.
  30. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  31. Attention interpretability across nlp tasks. arXiv preprint arXiv:1909.11218, 2019.
  32. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  33. Larger language models do in-context learning differently. arXiv preprint arXiv:2303.03846, 2023.
  34. Memorizing transformers. arXiv preprint arXiv:2203.08913, 2022.
  35. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pages 2048–2057. PMLR, 2015.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Alexander Peysakhovich (22 papers)
  2. Adam Lerer (30 papers)
Citations (31)
Youtube Logo Streamline Icon: https://streamlinehq.com