Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems (2407.01370v1)

Published 1 Jul 2024 in cs.CL

Abstract: LLMs and RAG systems are now capable of handling millions of input tokens or more. However, evaluating the output quality of such systems on long-context tasks remains challenging, as tasks like Needle-in-a-Haystack lack complexity. In this work, we argue that summarization can play a central role in such evaluation. We design a procedure to synthesize Haystacks of documents, ensuring that specific \textit{insights} repeat across documents. The "Summary of a Haystack" (SummHay) task then requires a system to process the Haystack and generate, given a query, a summary that identifies the relevant insights and precisely cites the source documents. Since we have precise knowledge of what insights should appear in a haystack summary and what documents should be cited, we implement a highly reproducible automatic evaluation that can score summaries on two aspects - Coverage and Citation. We generate Haystacks in two domains (conversation, news), and perform a large-scale evaluation of 10 LLMs and corresponding 50 RAG systems. Our findings indicate that SummHay is an open challenge for current systems, as even systems provided with an Oracle signal of document relevance lag our estimate of human performance (56\%) by 10+ points on a Joint Score. Without a retriever, long-context LLMs like GPT-4o and Claude 3 Opus score below 20% on SummHay. We show SummHay can also be used to study enterprise RAG systems and position bias in long-context models. We hope future systems can equal and surpass human performance on SummHay.

PDF HTML Abstract

Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems

The paper "Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems" presents a novel benchmark task, SummHay, designed to evaluate the capabilities of long-context LLMs and retrieval-augmented generation (RAG) systems. The authors Philippe Laban, Alexander R. Fabbri, Caiming Xiong, and Chien-Sheng Wu at Salesforce AI Research aim to address the current limitations in evaluating LLMs' performance on long-context tasks, proposing summarization as a more complex and relevant testbed.

Background and Motivation

Recent advancements in LLMs like Claude-3 and Gemini-1.5-pro have expanded the feasible context lengths to hundreds of thousands or even millions of tokens. In parallel, RAG systems have been proposed to efficiently manage context length by dynamically selecting relevant context from a large corpus. While traditional evaluations like the "Needle-in-a-Haystack" task have been employed to test long-context recall, these do not suffice in differentiating among the capabilities of cutting-edge models. The SummHay task is introduced to fill this gap by requiring models to generate long-form answers that summarize relevant insights from a large corpus of documents and cite their sources accurately.

Task and Dataset

The SummHay task requires systems to process a large "Haystack" of documents, identify insights relevant to a specific query, summarize these insights, and cite their sources accurately. The dataset generation process involves synthesizing Haystack documents in two domains: conversations and news articles. Each Haystack contains 100 documents, equating to around 100k tokens, with clearly defined subtopics and insights. These insights repeat across documents to allow for precise evaluation of generated summaries. The SummHay benchmark thus includes a total of 10 Haystacks and 92 query-based summarization tasks.

Evaluation Protocol

Evaluation of the SummHay task is centered on two aspects: Coverage and Citation. The Coverage Score assesses the extent to which a system-generated summary covers the reference insights. The Citation Score evaluates the accuracy and completeness of the citations in the summary. These scores are combined into a Joint Score to provide a holistic measure of the summarization quality. The evaluation protocol is verified for human annotator reproducibility and is shown to be cost-effective and reliable when automated using LLMs like GPT-4o.

Experimental Results

The paper reports extensive experiments involving 10 LLMs and 50 RAG systems. Results indicate that all current systems significantly underperform compared to human performance on the SummHay task. Even with optimal document retrieval, systems lag human performance by over 10 points in Joint Score. Notably, long-context LLMs without a retriever score below 20% in Joint Score. The experiments also highlight that RAG systems, especially those utilizing advanced retrievers like Cohere's Rerank3, perform better than random retrieval, but still have considerable room for improvement.

Implications and Future Directions

The findings underscore the complexity and challenges of long-context summarization and citation tasks. The SummHay benchmark provides a robust framework for evaluating LLMs and RAG systems, revealing their current limitations and guiding future research. Future systems aiming to excel on SummHay will need to balance accurate insight coverage with precise and complete citation. This task also opens avenues for exploring enterprise RAG systems, addressing positional biases in long-context models, and refining evaluation metrics. Future developments are anticipated to bridge the gap towards and eventually surpass human performance in complex summarization tasks.

Conclusion

"Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems" offers a significant advancement in benchmarking long-context models. By introducing the SummHay task, the paper provides a nuanced and stringent evaluation method that reveals existing shortfalls and sets a high bar for future improvements in summarization and citation accuracy. The open-sourced dataset and evaluation methodology are expected to catalyze further research, pushing the boundaries of what long-context LLMs and RAG systems can achieve.