Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? (2406.13121v1)

Published 19 Jun 2024 in cs.CL, cs.AI, and cs.IR

Abstract: Long-context LLMs (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases. Leveraging LCLMs' ability to natively ingest and process entire corpora of information offers numerous advantages. It enhances user-friendliness by eliminating the need for specialized knowledge of tools, provides robust end-to-end modeling that minimizes cascading errors in complex pipelines, and allows for the application of sophisticated prompting techniques across the entire system. To assess this paradigm shift, we introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning. Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks. However, LCLMs still face challenges in areas like compositional reasoning that are required in SQL-like tasks. Notably, prompting strategies significantly influence performance, emphasizing the need for continued research as context lengths grow. Overall, LOFT provides a rigorous testing ground for LCLMs, showcasing their potential to supplant existing paradigms and tackle novel tasks as model capabilities scale.

PDF HTML Abstract

Can Long-Context LLMs Subsume Retrieval, RAG, SQL, and More?

The paper at hand explores an intriguing frontier in NLP: the potential of Long-Context LLMs (LCLMs) to perform a variety of tasks traditionally requiring specialized tools and pipelines. These tasks include retrieval, Retrieval-Augmented Generation (RAG), SQL-like querying, and many-shot In-Context Learning (ICL). The paper introduces the "Long-Context Frontiers" (LOFT) benchmark to evaluate the capabilities of LCLMs on tasks involving context lengths up to one million tokens.

Key Contributions

Introduction of LOFT Benchmark: The LOFT benchmark is composed of six tasks spanning 35 datasets across text, visual, and audio modalities. It allows for automatic scaling of context lengths up to one million tokens, providing a rigorous ground for evaluating LCLMs.
Corpus-in-Context (CiC) Prompting: The authors propose a novel prompting strategy termed Corpus-in-Context (CiC), which leverages LCLMs' ability to process large corpora directly within their context window. This approach integrates task-specific instructions, few-shot examples, and chain-of-thought reasoning to optimize LCLM performance.
Comparative Analysis: The paper compares three state-of-the-art LCLMs—Gemini 1.5 Pro, GPT-4o, and Claude 3 Opus—against specialized models fine-tuned for specific tasks. Metrics include Recall@1 for retrieval, subspan exact match (EM) for RAG, and accuracy for SQL and ICL.

Numerical Results

Text Retrieval: LCLMs demonstrated competitive performance with specialized retrieval models like Gecko, particularly at 128k token context lengths. For example, Gemini 1.5 Pro achieved a Recall@1 score of 0.98 in the FEVER dataset, compared to 0.96 by the specialized model.
Visual and Audio Retrieval: In visual retrieval tasks, Gemini 1.5 Pro outperformed GPT-4o and even outperformed the specialized model CLIP in certain cases, such as achieving a Recall@1 score of 0.84 on Flickr30k. In audio retrieval, Gemini 1.5 Pro showcased superior performance compared to PaLM 2 DE, attaining perfect scores across multiple languages in the FLEURS dataset.
RAG: LCLMs matched or exceeded the performance of traditional RAG pipelines on datasets requiring complex multi-hop reasoning. For instance, Gemini 1.5 Pro achieved a subspan EM score of 0.75 on HotPotQA, outpacing traditional RAG models.
SQL: The paper revealed that LCLMs lag behind specialized SQL pipelines in tasks requiring complex compositional reasoning. For example, Gemini 1.5 Pro attained an accuracy of 0.40 on Spider, compared to 0.74 by specialized models.
Many-Shot ICL: In ICL tasks, the LCLMs showcased a varying degree of success. For simpler tasks like BBH-date, performance scaled positively with the number of in-context examples; however, more complex reasoning tasks exhibited stagnation or even degradation in performance as examples scaled.

Implications and Future Directions

Practical Implications: The ability of LCLMs to handle tasks traditionally managed by specialized models implies a significant potential for both simplification and unification of various NLP pipelines. This reduction in complexity could lead to broader usability and easier deployment of AI systems devoid of intricate, task-specific architectures.
Theoretical Implications: The paper underscores the importance of continued research on scaling context lengths. While LCLMs show promise, performance degradation at higher context lengths (e.g., one million tokens) suggests that current architectures might need further optimization in handling long contexts robustly.
Future Developments: Anticipated future directions include enhancing long-context reasoning capabilities and addressing efficiency concerns associated with encoding massive contexts. Potential advancements might involve optimizing the use of prefix-caching and investigating the application of LCLMs to even larger context windows—potentially scaling up to billions of tokens.

Conclusion

The utilization of LCLMs to perform tasks across retrieval, RAG, SQL-like querying, and many-shot ICL unifies various complex pipelines into a more simplified approach. Through the introduction of the LOFT benchmark, the paper illustrates that state-of-the-art LCLMs can rival or even surpass specialized models in multiple tasks, while also highlighting areas ripe for future research. The potential of LCLMs to supplant existing paradigms presents a transformative shift in the landscape of NLP, promising new applications and enhanced model capabilities. The journey of LCLMs is just beginning, with significant room for improvement and fascinating developments on the horizon.