Can't Remember Details in Long Documents? You Need Some R&R (2403.05004v1)

Published 8 Mar 2024 in cs.CL, cs.AI, cs.IR, and cs.LG

Abstract: Long-context LLMs hold promise for tasks such as question-answering (QA) over long documents, but they tend to miss important information in the middle of context documents (arXiv:2307.03172v3). Here, we introduce $\textit{R&R}$ -- a combination of two novel prompt-based methods called $\textit{reprompting}$ and $\textit{in-context retrieval}$ (ICR) -- to alleviate this effect in document-based QA. In reprompting, we repeat the prompt instructions periodically throughout the context document to remind the LLM of its original task. In ICR, rather than instructing the LLM to answer the question directly, we instruct it to retrieve the top $k$ passage numbers most relevant to the given question, which are then used as an abbreviated context in a second QA prompt. We test R&R with GPT-4 Turbo and Claude-2.1 on documents up to 80k tokens in length and observe a 16-point boost in QA accuracy on average. Our further analysis suggests that R&R improves performance on long document-based QA because it reduces the distance between relevant context and the instructions. Finally, we show that compared to short-context chunkwise methods, R&R enables the use of larger chunks that cost fewer LLM calls and output tokens, while minimizing the drop in accuracy.

PDF HTML Abstract

Prompt-Based Methods for Enhancing Long-Context QA in LLMs

The paper "Can’t Remember Details in Long Documents? You Need Some R{content}R" addresses a key challenge in the field of NLP: the declining efficacy of LLMs in handling long-context question answering (QA). The authors propose a novel approach, R{content}R, a synthesis of two prompt-based methods—reprompting and in-context retrieval (ICR)—to mitigate the "lost in the middle" effect identified by prior research.

Overview of R{content}R

The R{content}R method introduces two main strategies:

Reprompting: This technique involves the strategic repetition of instructions throughout a document, specifically to remind the LLM of the task at hand. The hypothesis is that decreasing the positional distance between relevant content and task instructions can mitigate the positional bias of LLMs, which typically favor information located at the beginning or end of input prompts.
In-Context Retrieval (ICR): This method is inspired by retrieval-augmented generation techniques. Rather than posing the question directly to the LLM, the model is first tasked with identifying the most relevant passages within the document. These passages form an abbreviated context used in a subsequent round of QA, effectively simplifying the LLM's task by focusing on potentially relevant information.

The authors evaluate R{content}R using GPT-4 Turbo and Claude-2.1 on datasets including NaturalQuestions-Open (NQ), SQuAD, HotPotQA, and a synthetic PubMed-based dataset, spanning document lengths up to 80,000 tokens. The results show that R{content}R provides a substantial improvement in QA accuracy, with an average increase of 16 percentage points.

Numerical Results and Analysis

The results presented offer compelling evidence of the efficacy of R{content}R. Specifically, when applied to long-context tasks, reprompting and ICR not only boost performance but also enable the use of larger text chunks, reducing the necessity for multiple LLM calls and the associated computational costs. By implementing the strategies of reprompting uniformly and ICR, the authors observe a marked improvement in the QA tasks across different datasets, most notably in scenarios where traditional methods would falter due to the "lost in the middle" problem.

The paper also conducts an analysis to discern the mechanism underlying the observed improvements. The proximity of relevant context to repeated instructions seems to play a crucial role, contributing to the mitigation of task degradation in lengthy documents.

Implications and Future Directions

The implications of R{content}R are significant both theoretically and practically. By demonstrating a method to extend the practical context length of LLMs, the paper contributes to the ongoing discourse on overcoming the limitations imposed by LLM architecture, particularly the quadratic dependency of self-attention mechanisms. Practically, this workaround provides immediate applicability to black-box LLM models, often proprietary and otherwise resistant to internal modifications.

For future research, several avenues are suggested. The potential combination of R{content}R with other prompt optimizations might further enhance the utility of long-context LLM applications. Moreover, adapting R{content}R to tasks beyond document-based QA, such as summarization and other high-context understanding tasks, could broaden its applicability. Investigating the attention patterns in open-access LLMs could further elucidate the intricacies of reprompting effectiveness.

Conclusion

The R{content}R method provides a notable advancement in handling long-context QA with LLMs, presenting a pragmatic solution to a previously difficult problem. While the approach primarily applies to long documents, its principles may inspire broader strategies in NLP for dealing with complex input dependencies. As the landscape of LLM applications expands, the methods and insights presented in this paper will likely prove instrumental in driving further innovations.

PDF Markdown Bookmark Chat (Pro)

References (20)

Authors (3)

Devanshu Agrawal (7 papers)
Shang Gao (74 papers)
Martin Gajek (5 papers)

Citations (5)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/arankomatsuzaki/status/1774082905700237527

https://twitter.com/n0riskn0r3ward/status/1777709283531985296

https://twitter.com/carterleffen/status/1767971388558557559

https://twitter.com/n0riskn0r3ward/status/1878488372777283971

https://twitter.com/Moi39017963/status/1767093903541367266

https://twitter.com/Prince_Canuma/status/1768017583574266282