Retrieval meets Long Context Large Language Models (2310.03025v2)

Published 4 Oct 2023 in cs.CL, cs.AI, cs.IR, and cs.LG

Abstract: Extending the context window of LLMs is getting popular recently, while the solution of augmenting LLMs with retrieval has existed for years. The natural questions are: i) Retrieval-augmentation versus long context window, which one is better for downstream tasks? ii) Can both methods be combined to get the best of both worlds? In this work, we answer these questions by studying both solutions using two state-of-the-art pretrained LLMs, i.e., a proprietary 43B GPT and Llama2-70B. Perhaps surprisingly, we find that LLM with 4K context window using simple retrieval-augmentation at generation can achieve comparable performance to finetuned LLM with 16K context window via positional interpolation on long context tasks, while taking much less computation. More importantly, we demonstrate that retrieval can significantly improve the performance of LLMs regardless of their extended context window sizes. Our best model, retrieval-augmented Llama2-70B with 32K context window, outperforms GPT-3.5-turbo-16k and Davinci003 in terms of average score on nine long context tasks including question answering, query-based summarization, and in-context few-shot learning tasks. It also outperforms its non-retrieval Llama2-70B-32k baseline by a margin, while being much faster at generation. Our study provides general insights on the choice of retrieval-augmentation versus long context extension of LLM for practitioners.

PDF Abstract

Retrieval Meets Long Context LLMs

In this paper, the authors investigate the merits of long context windows in LLMs against retrieval-augmentation mechanisms and explore the synergistic potential of combining both approaches. The paper provides an empirical analysis based on two state-of-the-art LLMs: a proprietary 43B GPT model and Llama2-70B. With the increasing demand in both industry and academia for extending the context window of LLMs, the paper's findings are instrumental in guiding practical decisions regarding the efficiency and effectiveness of long context extensions versus retrieval enhancements.

Study Overview

The paper pivots around two fundamental questions: Which method, between retrieval-augmentation and extended context windows, offers superior performance on downstream tasks? Additionally, can these methods be integrated for optimal results? The authors conduct a comparative evaluation across nine diverse long context tasks including question answering, query-based summarization, and in-context few-shot learning.

Key Findings

Retrieval vs. Extended Contexts: It is discovered that integrating simple retrieval-augmentation into LLMs with a 4K context window yields comparable performance to a 16K context window LLM fine-tuned using positional interpolation. Interestingly, this comes at a reduced computational cost.
Performance Improvement with Retrieval: Across varying context window sizes, retrieval consistently enhances LLM performance. The paper highlights a retrieval-augmented Llama2-70B with a 32K context window significantly surpassing the performance of well-known models like GPT-3.5-turbo-16k and Davinci-003. The retrieval-augmented model achieves an average score of 43.6, outperforming its non-retrieval counterpart (score of 40.9), while also being computationally faster in generation tasks.

Implications and Future Directions

This paper's results imply that practitioners can consider simple retrieval-augmentation as a viable alternative or complement to extending context windows for LLMs, resulting in efficient and effective model performance with fewer computational demands. The work suggests that hybrid models employing both extended context and retrieval mechanisms can optimize LLM functionality across tasks that demand substantial contextual understanding.

Recommendations for future research include further exploration into the optimal alignment of retrieval mechanisms with long context architectures, particularly in models beyond the current tested scale. Addressing challenges such as overcoming the "lost in the middle" phenomenon observed in LLMs, where models struggle to retain attention over very long sequences, is a promising avenue for improving retrieval-enhanced models. Additionally, advancing methods that integrate memory or hierarchical attention strategies might further align retrieval with long context capabilities, potentially leading to robust efficiency gains in model performance.

In sum, the paper contributes meaningful insights into the efficient application of large-scale LLMs for complex tasks demanding extensive contexts, and the findings serve as a significant resource for the continued development and operationalization of sophisticated LLM frameworks in both academic and industry settings.