Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Can't Remember Details in Long Documents? You Need Some R&R (2403.05004v1)

Published 8 Mar 2024 in cs.CL, cs.AI, cs.IR, and cs.LG

Abstract: Long-context LLMs hold promise for tasks such as question-answering (QA) over long documents, but they tend to miss important information in the middle of context documents (arXiv:2307.03172v3). Here, we introduce $\textit{R&R}$ -- a combination of two novel prompt-based methods called $\textit{reprompting}$ and $\textit{in-context retrieval}$ (ICR) -- to alleviate this effect in document-based QA. In reprompting, we repeat the prompt instructions periodically throughout the context document to remind the LLM of its original task. In ICR, rather than instructing the LLM to answer the question directly, we instruct it to retrieve the top $k$ passage numbers most relevant to the given question, which are then used as an abbreviated context in a second QA prompt. We test R&R with GPT-4 Turbo and Claude-2.1 on documents up to 80k tokens in length and observe a 16-point boost in QA accuracy on average. Our further analysis suggests that R&R improves performance on long document-based QA because it reduces the distance between relevant context and the instructions. Finally, we show that compared to short-context chunkwise methods, R&R enables the use of larger chunks that cost fewer LLM calls and output tokens, while minimizing the drop in accuracy.

Prompt-Based Methods for Enhancing Long-Context QA in LLMs

The paper "Can’t Remember Details in Long Documents? You Need Some R{content}R" addresses a key challenge in the field of NLP: the declining efficacy of LLMs in handling long-context question answering (QA). The authors propose a novel approach, R{content}R, a synthesis of two prompt-based methods—reprompting and in-context retrieval (ICR)—to mitigate the "lost in the middle" effect identified by prior research.

Overview of R{content}R

The R{content}R method introduces two main strategies:

  1. Reprompting: This technique involves the strategic repetition of instructions throughout a document, specifically to remind the LLM of the task at hand. The hypothesis is that decreasing the positional distance between relevant content and task instructions can mitigate the positional bias of LLMs, which typically favor information located at the beginning or end of input prompts.
  2. In-Context Retrieval (ICR): This method is inspired by retrieval-augmented generation techniques. Rather than posing the question directly to the LLM, the model is first tasked with identifying the most relevant passages within the document. These passages form an abbreviated context used in a subsequent round of QA, effectively simplifying the LLM's task by focusing on potentially relevant information.

The authors evaluate R{content}R using GPT-4 Turbo and Claude-2.1 on datasets including NaturalQuestions-Open (NQ), SQuAD, HotPotQA, and a synthetic PubMed-based dataset, spanning document lengths up to 80,000 tokens. The results show that R{content}R provides a substantial improvement in QA accuracy, with an average increase of 16 percentage points.

Numerical Results and Analysis

The results presented offer compelling evidence of the efficacy of R{content}R. Specifically, when applied to long-context tasks, reprompting and ICR not only boost performance but also enable the use of larger text chunks, reducing the necessity for multiple LLM calls and the associated computational costs. By implementing the strategies of reprompting uniformly and ICR, the authors observe a marked improvement in the QA tasks across different datasets, most notably in scenarios where traditional methods would falter due to the "lost in the middle" problem.

The paper also conducts an analysis to discern the mechanism underlying the observed improvements. The proximity of relevant context to repeated instructions seems to play a crucial role, contributing to the mitigation of task degradation in lengthy documents.

Implications and Future Directions

The implications of R{content}R are significant both theoretically and practically. By demonstrating a method to extend the practical context length of LLMs, the paper contributes to the ongoing discourse on overcoming the limitations imposed by LLM architecture, particularly the quadratic dependency of self-attention mechanisms. Practically, this workaround provides immediate applicability to black-box LLM models, often proprietary and otherwise resistant to internal modifications.

For future research, several avenues are suggested. The potential combination of R{content}R with other prompt optimizations might further enhance the utility of long-context LLM applications. Moreover, adapting R{content}R to tasks beyond document-based QA, such as summarization and other high-context understanding tasks, could broaden its applicability. Investigating the attention patterns in open-access LLMs could further elucidate the intricacies of reprompting effectiveness.

Conclusion

The R{content}R method provides a notable advancement in handling long-context QA with LLMs, presenting a pragmatic solution to a previously difficult problem. While the approach primarily applies to long documents, its principles may inspire broader strategies in NLP for dealing with complex input dependencies. As the landscape of LLM applications expands, the methods and insights presented in this paper will likely prove instrumental in driving further innovations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)
  1. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595.
  2. Longlora: Efficient fine-tuning of long-context large language models. arXiv preprint arXiv:2309.12307.
  3. The curious case of neural text degeneration. In International Conference on Learning Representations.
  4. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466.
  5. Retrieval-augmented generation for knowledge-intensive nlp tasks. NeurIPS.
  6. Pretrained transformers for text ranking: Bert and beyond. Springer Nature.
  7. Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172.
  8. Amirkeivan Mohtashami and Martin Jaggi. 2023. Landmark attention: Random-access infinite context length for transformers. arXiv preprint arXiv:2305.16300.
  9. Know what you don’t know: Unanswerable questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789, Melbourne, Australia. Association for Computational Linguistics.
  10. Parallel context windows for large language models. In ACL.
  11. On position bias in summarization with large language models. arXiv preprint arXiv:2310.10570.
  12. Recipes for building an open-domain chatbot. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 300–325, Online. Association for Computational Linguistics.
  13. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864.
  14. Found in the middle: Permutation self-consistency improves listwise ranking in large language models. arXiv preprint arXiv:2310.07712.
  15. Focused transformer: Contrastive training for context scaling. arXiv preprint arXiv:2307.03170.
  16. Neural text generation with unlikelihood training. In International Conference on Learning Representations.
  17. Jason Weston and Sainbayar Sukhbaatar. 2023. System 2 attention (is something you might need too). arXiv preprint arXiv:2311.11829.
  18. Retrieval meets long context large language models. In The Twelfth International Conference on Learning Representations.
  19. Re-reading improves reasoning in language models. In International Conference on Learning Representations.
  20. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, Brussels, Belgium. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Devanshu Agrawal (7 papers)
  2. Shang Gao (74 papers)
  3. Martin Gajek (5 papers)
Citations (5)