Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems
The paper "Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems" presents a novel benchmark task, SummHay, designed to evaluate the capabilities of long-context LLMs and retrieval-augmented generation (RAG) systems. The authors Philippe Laban, Alexander R. Fabbri, Caiming Xiong, and Chien-Sheng Wu at Salesforce AI Research aim to address the current limitations in evaluating LLMs' performance on long-context tasks, proposing summarization as a more complex and relevant testbed.
Background and Motivation
Recent advancements in LLMs like Claude-3 and Gemini-1.5-pro have expanded the feasible context lengths to hundreds of thousands or even millions of tokens. In parallel, RAG systems have been proposed to efficiently manage context length by dynamically selecting relevant context from a large corpus. While traditional evaluations like the "Needle-in-a-Haystack" task have been employed to test long-context recall, these do not suffice in differentiating among the capabilities of cutting-edge models. The SummHay task is introduced to fill this gap by requiring models to generate long-form answers that summarize relevant insights from a large corpus of documents and cite their sources accurately.
Task and Dataset
The SummHay task requires systems to process a large "Haystack" of documents, identify insights relevant to a specific query, summarize these insights, and cite their sources accurately. The dataset generation process involves synthesizing Haystack documents in two domains: conversations and news articles. Each Haystack contains 100 documents, equating to around 100k tokens, with clearly defined subtopics and insights. These insights repeat across documents to allow for precise evaluation of generated summaries. The SummHay benchmark thus includes a total of 10 Haystacks and 92 query-based summarization tasks.
Evaluation Protocol
Evaluation of the SummHay task is centered on two aspects: Coverage and Citation. The Coverage Score assesses the extent to which a system-generated summary covers the reference insights. The Citation Score evaluates the accuracy and completeness of the citations in the summary. These scores are combined into a Joint Score to provide a holistic measure of the summarization quality. The evaluation protocol is verified for human annotator reproducibility and is shown to be cost-effective and reliable when automated using LLMs like GPT-4o.
Experimental Results
The paper reports extensive experiments involving 10 LLMs and 50 RAG systems. Results indicate that all current systems significantly underperform compared to human performance on the SummHay task. Even with optimal document retrieval, systems lag human performance by over 10 points in Joint Score. Notably, long-context LLMs without a retriever score below 20% in Joint Score. The experiments also highlight that RAG systems, especially those utilizing advanced retrievers like Cohere's Rerank3, perform better than random retrieval, but still have considerable room for improvement.
Implications and Future Directions
The findings underscore the complexity and challenges of long-context summarization and citation tasks. The SummHay benchmark provides a robust framework for evaluating LLMs and RAG systems, revealing their current limitations and guiding future research. Future systems aiming to excel on SummHay will need to balance accurate insight coverage with precise and complete citation. This task also opens avenues for exploring enterprise RAG systems, addressing positional biases in long-context models, and refining evaluation metrics. Future developments are anticipated to bridge the gap towards and eventually surpass human performance in complex summarization tasks.
Conclusion
"Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems" offers a significant advancement in benchmarking long-context models. By introducing the SummHay task, the paper provides a nuanced and stringent evaluation method that reveals existing shortfalls and sets a high bar for future improvements in summarization and citation accuracy. The open-sourced dataset and evaluation methodology are expected to catalyze further research, pushing the boundaries of what long-context LLMs and RAG systems can achieve.