Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
117 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Long$^2$RAG: Evaluating Long-Context & Long-Form Retrieval-Augmented Generation with Key Point Recall (2410.23000v3)

Published 30 Oct 2024 in cs.CL

Abstract: Retrieval-augmented generation (RAG) is a promising approach to address the limitations of fixed knowledge in LLMs. However, current benchmarks for evaluating RAG systems suffer from two key deficiencies: (1) they fail to adequately measure LLMs' capability in handling long-context retrieval due to a lack of datasets that reflect the characteristics of retrieved documents, and (2) they lack a comprehensive evaluation method for assessing LLMs' ability to generate long-form responses that effectively exploits retrieved information. To address these shortcomings, we introduce the Long$2$RAG benchmark and the Key Point Recall (KPR) metric. Long$2$RAG comprises 280 questions spanning 10 domains and across 8 question categories, each associated with 5 retrieved documents with an average length of 2,444 words. KPR evaluates the extent to which LLMs incorporate key points extracted from the retrieved documents into their generated responses, providing a more nuanced assessment of their ability to exploit retrieved information.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219.
  2. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization@ACL 2005, Ann Arbor, Michigan, USA, June 29, 2005, pages 65–72. Association for Computational Linguistics.
  3. Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pages 2206–2240. PMLR.
  4. Evaluation of text generation: A survey. arXiv preprint arXiv:2006.14799.
  5. Benchmarking large language models in retrieval-augmented generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 17754–17762.
  6. Cheng-Han Chiang and Hung-yi Lee. 2023. Can large language models be an alternative to human evaluations? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15607–15631, Toronto, Canada. Association for Computational Linguistics.
  7. Ragas: Automated evaluation of retrieval augmented generation. arXiv preprint arXiv:2309.15217.
  8. Qafacteval: Improved qa-based factual consistency evaluation for summarization. arXiv preprint arXiv:2112.08542.
  9. Eli5: Long form question answering. arXiv preprint arXiv:1907.09190.
  10. The chronicles of rag: The retriever, the chunk and the generator. arXiv preprint arXiv:2401.07883.
  11. Enabling large language models to generate text with citations. arXiv preprint arXiv:2305.14627.
  12. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997.
  13. Shahriar Golchin and Mihai Surdeanu. 2023. Time travel in llms: Tracing data contamination in large language models. arXiv preprint arXiv:2308.08493.
  14. Rethinking with retrieval: Faithful large language model inference. arXiv preprint arXiv:2301.00303.
  15. Can llms generate random numbers? evaluating llm sampling in controlled domains. In ICML 2023 Workshop: Sampling and Optimization in Discrete Space.
  16. Amortizing intractable inference in large language models. arXiv preprint arXiv:2310.04363.
  17. Mistral 7b. CoRR, abs/2310.06825.
  18. Mixtral of experts. CoRR, abs/2401.04088.
  19. RankGen: Improving text generation with large ranking models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 199–232, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  20. Hurdles to progress in long-form question answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4940–4957, Online. Association for Computational Linguistics.
  21. Yucheng Li. 2023. An open source data contamination report for llama series models. arXiv preprint arXiv:2310.17589.
  22. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  23. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157–173.
  24. G-eval: Nlg evaluation using gpt-4 with better human alignment (2023). URL http://arxiv. org/abs/2303.16634.
  25. Bge landmark embedding: A chunking-free embedding method for retrieval augmented long-context large language models. arXiv preprint arXiv:2402.11573.
  26. Crud-rag: A comprehensive chinese benchmark for retrieval-augmented generation of large language models. arXiv preprint arXiv:2401.17043.
  27. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, Singapore. Association for Computational Linguistics.
  28. OpenAI. 2024. Gpt-4 technical report. Preprint, arXiv:2303.08774.
  29. Giraffe: Adventures in expanding context lengths in llms. arXiv preprint arXiv:2308.10882.
  30. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36.
  31. Justus J Randolph. 2005. Free-marginal multirater kappa (multirater k [free]): An alternative to fleiss’ fixed-marginal multirater kappa. Online submission.
  32. Ares: An automated evaluation framework for retrieval-augmented generation systems. arXiv preprint arXiv:2311.09476.
  33. Zeroscrolls: A zero-shot benchmark for long text understanding. arXiv preprint arXiv:2305.14196.
  34. ASQA: Factoid questions meet long-form answers. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8273–8288, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  35. Alessandro Stolfo. 2024. Groundedness in retrieval-augmented long-form generation: An empirical study. arXiv preprint arXiv:2404.07060.
  36. Proxyqa: An alternative framework for evaluating long-form text generation with large language models. arXiv preprint arXiv:2401.15042.
  37. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  38. Long-form factuality in large language models. arXiv preprint arXiv:2403.18802.
  39. Effective long-context scaling of foundation models. arXiv preprint arXiv:2309.16039.
  40. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244.
  41. A critical evaluation of evaluations for long-form question answering. arXiv preprint arXiv:2305.18201.
  42. Retrieval meets long context large language models. arXiv preprint arXiv:2310.03025.
  43. Debateqa: Evaluating question answering on debatable knowledge. arXiv preprint arXiv:2408.01419.
  44. Knowledge conflicts for llms: A survey. arXiv preprint arXiv:2403.08319.
  45. Bartscore: Evaluating generated text as text generation. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 27263–27277.
  46. Bertscore: Evaluating text generation with BERT. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
  47. Autonomous data selection with language models for mathematical texts. In ICLR 2024 Workshop on Navigating and Addressing Data Problems for Foundation Models.
  48. Retrievalqa: Assessing adaptive retrieval-augmented generation for short-form open-domain question answering. arXiv preprint arXiv:2402.16457.

Summary

  • The paper introduces the Long²RAG benchmark with the innovative Key Point Recall metric to assess how well LLMs incorporate essential information from extensive documents.
  • Empirical results reveal that performance declines with longer document lengths and that effective retrieval strategies are crucial for maintaining key point accuracy.
  • The study highlights marked differences between closed-source and open-source models, urging the development of new architectures to better handle long-context data.

Evaluation of Long-Context RAG in LLMs Using Long2^2RAG and Key Point Recall

The paper presents a novel benchmark, Long2^2RAG, accompanied by an innovative metric, Key Point Recall (KPR), to enhance the evaluation of retrieval-augmented generation (RAG) systems in the sphere of long-context, long-form responses. The focus is on addressing fundamental limitations in existing benchmarks which insufficiently assess the capabilities of LLMs in handling extensive information retrieval and generating comprehensive responses from such retrievals.

Benchmark and Metric Design

The Long2^2RAG benchmark is meticulously designed. It comprises 280 questions spanning 10 diverse domains and across 8 distinct question categories. Each question is anchored to 5 retrieved documents with substantial average length, which demand the models to effectively exploit the vast contextual data. Concurrently, the KPR metric assesses the extent to which generated responses incorporate critical points from the retrieved documents, providing nuanced insights into the efficacy of LLMs in leveraging supplementary information sources.

Empirical Evaluation

Evaluation of 9 state-of-the-art LLMs using Long2^2RAG reveals several key insights:

  1. Closed vs. Open Source Performance: GPT-4o, representing closed-source systems, consistently outperforms leading open-source models such as Qwen2 and Mis(x)tral. Notably, among open-source models, Phi-3-mini exhibits competitive performance despite its smaller parameter size, challenging the assumption that larger models inherently offer superior results.
  2. Effect of Document Length: A critical finding is the performance degradation with increasing document length. The models demonstrated a decline in their ability to recall key information as the document size increased, reflecting challenges in managing extensive contextual details.
  3. Retrieval Strategy Impact: Through varied truncation strategies and summarization, the paper observes that cutting down document lengths typically reduces performance, confirming the importance of utilizing full document context within RAG settings.
  4. Comparator Efficacy: The robustness of the KPR metric is demonstrated across different evaluators, with GPT-4o and Llama3-70B yielding consistent model ranking, albeit with score variations, reinforcing the metric's reliability.

Implications and Future Directions

The presented work significantly contributes to the landscape of RAG evaluation within LLMs by advancing both the methodological and practical approaches to assessment. Practically, the Long2^2RAG benchmark and KPR metric set a new standard for measuring the integration of retrieved information into coherent, long-form responses.

Theoretically, this research paves the way for further exploration of LLMs' capacity to manage long-context retrievals, emphasizing the need for new architectures or enhanced training paradigms capable of addressing challenges in processing extensive contextual input.

Future research could expand the Long2^2RAG dataset, incorporate multilingual assessments to augment its applicability beyond English, and refine the KPR metric to reduce dependency on extracted key points, potentially integrating them with human preference models.

In summary, the paper provides a comprehensive framework to dissect and evaluate the intricacies of RAG within LLMs, setting a foundation for future exploration and innovation in enhancing the capabilities of LLMs to handle long-contextual information effectively.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com