Long$^2$RAG: Evaluating Long-Context & Long-Form Retrieval-Augmented Generation with Key Point Recall (2410.23000v3)
Abstract: Retrieval-augmented generation (RAG) is a promising approach to address the limitations of fixed knowledge in LLMs. However, current benchmarks for evaluating RAG systems suffer from two key deficiencies: (1) they fail to adequately measure LLMs' capability in handling long-context retrieval due to a lack of datasets that reflect the characteristics of retrieved documents, and (2) they lack a comprehensive evaluation method for assessing LLMs' ability to generate long-form responses that effectively exploits retrieved information. To address these shortcomings, we introduce the Long$2$RAG benchmark and the Key Point Recall (KPR) metric. Long$2$RAG comprises 280 questions spanning 10 domains and across 8 question categories, each associated with 5 retrieved documents with an average length of 2,444 words. KPR evaluates the extent to which LLMs incorporate key points extracted from the retrieved documents into their generated responses, providing a more nuanced assessment of their ability to exploit retrieved information.
- Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219.
- Satanjeev Banerjee and Alon Lavie. 2005. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization@ACL 2005, Ann Arbor, Michigan, USA, June 29, 2005, pages 65–72. Association for Computational Linguistics.
- Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pages 2206–2240. PMLR.
- Evaluation of text generation: A survey. arXiv preprint arXiv:2006.14799.
- Benchmarking large language models in retrieval-augmented generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 17754–17762.
- Cheng-Han Chiang and Hung-yi Lee. 2023. Can large language models be an alternative to human evaluations? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15607–15631, Toronto, Canada. Association for Computational Linguistics.
- Ragas: Automated evaluation of retrieval augmented generation. arXiv preprint arXiv:2309.15217.
- Qafacteval: Improved qa-based factual consistency evaluation for summarization. arXiv preprint arXiv:2112.08542.
- Eli5: Long form question answering. arXiv preprint arXiv:1907.09190.
- The chronicles of rag: The retriever, the chunk and the generator. arXiv preprint arXiv:2401.07883.
- Enabling large language models to generate text with citations. arXiv preprint arXiv:2305.14627.
- Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997.
- Shahriar Golchin and Mihai Surdeanu. 2023. Time travel in llms: Tracing data contamination in large language models. arXiv preprint arXiv:2308.08493.
- Rethinking with retrieval: Faithful large language model inference. arXiv preprint arXiv:2301.00303.
- Can llms generate random numbers? evaluating llm sampling in controlled domains. In ICML 2023 Workshop: Sampling and Optimization in Discrete Space.
- Amortizing intractable inference in large language models. arXiv preprint arXiv:2310.04363.
- Mistral 7b. CoRR, abs/2310.06825.
- Mixtral of experts. CoRR, abs/2401.04088.
- RankGen: Improving text generation with large ranking models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 199–232, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Hurdles to progress in long-form question answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4940–4957, Online. Association for Computational Linguistics.
- Yucheng Li. 2023. An open source data contamination report for llama series models. arXiv preprint arXiv:2310.17589.
- Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
- Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157–173.
- G-eval: Nlg evaluation using gpt-4 with better human alignment (2023). URL http://arxiv. org/abs/2303.16634.
- Bge landmark embedding: A chunking-free embedding method for retrieval augmented long-context large language models. arXiv preprint arXiv:2402.11573.
- Crud-rag: A comprehensive chinese benchmark for retrieval-augmented generation of large language models. arXiv preprint arXiv:2401.17043.
- FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, Singapore. Association for Computational Linguistics.
- OpenAI. 2024. Gpt-4 technical report. Preprint, arXiv:2303.08774.
- Giraffe: Adventures in expanding context lengths in llms. arXiv preprint arXiv:2308.10882.
- Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36.
- Justus J Randolph. 2005. Free-marginal multirater kappa (multirater k [free]): An alternative to fleiss’ fixed-marginal multirater kappa. Online submission.
- Ares: An automated evaluation framework for retrieval-augmented generation systems. arXiv preprint arXiv:2311.09476.
- Zeroscrolls: A zero-shot benchmark for long text understanding. arXiv preprint arXiv:2305.14196.
- ASQA: Factoid questions meet long-form answers. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8273–8288, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Alessandro Stolfo. 2024. Groundedness in retrieval-augmented long-form generation: An empirical study. arXiv preprint arXiv:2404.07060.
- Proxyqa: An alternative framework for evaluating long-form text generation with large language models. arXiv preprint arXiv:2401.15042.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Long-form factuality in large language models. arXiv preprint arXiv:2403.18802.
- Effective long-context scaling of foundation models. arXiv preprint arXiv:2309.16039.
- Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244.
- A critical evaluation of evaluations for long-form question answering. arXiv preprint arXiv:2305.18201.
- Retrieval meets long context large language models. arXiv preprint arXiv:2310.03025.
- Debateqa: Evaluating question answering on debatable knowledge. arXiv preprint arXiv:2408.01419.
- Knowledge conflicts for llms: A survey. arXiv preprint arXiv:2403.08319.
- Bartscore: Evaluating generated text as text generation. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 27263–27277.
- Bertscore: Evaluating text generation with BERT. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
- Autonomous data selection with language models for mathematical texts. In ICLR 2024 Workshop on Navigating and Addressing Data Problems for Foundation Models.
- Retrievalqa: Assessing adaptive retrieval-augmented generation for short-form open-domain question answering. arXiv preprint arXiv:2402.16457.