Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
117 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation Generation (2410.23090v1)

Published 30 Oct 2024 in cs.IR and cs.CL

Abstract: Retrieval-Augmented Generation (RAG) has become a powerful paradigm for enhancing LLMs through external knowledge retrieval. Despite its widespread attention, existing academic research predominantly focuses on single-turn RAG, leaving a significant gap in addressing the complexities of multi-turn conversations found in real-world applications. To bridge this gap, we introduce CORAL, a large-scale benchmark designed to assess RAG systems in realistic multi-turn conversational settings. CORAL includes diverse information-seeking conversations automatically derived from Wikipedia and tackles key challenges such as open-domain coverage, knowledge intensity, free-form responses, and topic shifts. It supports three core tasks of conversational RAG: passage retrieval, response generation, and citation labeling. We propose a unified framework to standardize various conversational RAG methods and conduct a comprehensive evaluation of these methods on CORAL, demonstrating substantial opportunities for improving existing approaches.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. Topiocqa: Open-domain conversational question answering with topic switching. Transactions of the Association for Computational Linguistics, 10:468–483.
  2. Moonshot AI. 2023. Kimi chat.
  3. Open-domain question answering goes conversational via question rewriting. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pages 520–534. Association for Computational Linguistics.
  4. Anthropic. 2023. Introducing claude.
  5. Crafting the path: Robust query rewriting for information retrieval. CoRR, abs/2407.12529.
  6. Generalizing conversational dense retrieval via llm-cognition data augmentation. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, pages 2700–2718. Association for Computational Linguistics.
  7. Dialog inpainting: Turning documents into dialogs. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 4558–4586. PMLR.
  8. Cast 2020: The conversational assistance track overview. In Proceedings of the Twenty-Ninth Text REtrieval Conference, TREC 2020, Virtual Event [Gaithersburg, Maryland, USA], November 16-20, 2020, volume 1266 of NIST Special Publication. National Institute of Standards and Technology (NIST).
  9. TREC cast 2019: The conversational assistance track overview. CoRR, abs/2003.13624.
  10. TREC cast 2021: The conversational assistance track overview. In Proceedings of the Thirtieth Text REtrieval Conference, TREC 2021, online, November 15-19, 2021, volume 500-335 of NIST Special Publication. National Institute of Standards and Technology (NIST).
  11. Wizard of wikipedia: Knowledge-powered conversational agents. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
  12. Longrope: Extending LLM context window beyond 2 million tokens. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net.
  13. Toward general instruction-following alignment for retrieval-augmented generation. arXiv preprint arXiv:2410.09584.
  14. Understand what LLM needs: Dual preference alignment for retrieval-augmented generation. CoRR, abs/2406.18676.
  15. The llama 3 herd of models. CoRR, abs/2407.21783.
  16. doc2dial: A goal-oriented document-grounded dialogue dataset. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pages 8118–8128. Association for Computational Linguistics.
  17. Enabling large language models to generate text with citations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 6465–6488. Association for Computational Linguistics.
  18. REALM: retrieval-augmented language model pre-training. CoRR, abs/2002.08909.
  19. Yizheng Huang and Jimmy Huang. 2024. A survey on retrieval-augmented text generation for large language models. CoRR, abs/2404.10981.
  20. Mistral 7b. CoRR, abs/2310.06825.
  21. Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, pages 1658–1677. Association for Computational Linguistics.
  22. BIDER: bridging knowledge inconsistency for efficient retrieval-augmented llms via key supporting evidence. In Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, pages 750–761. Association for Computational Linguistics.
  23. Instructor: Instructing unsupervised conversational dense retrieval with large language models. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 6649–6675. Association for Computational Linguistics.
  24. Billion-scale similarity search with gpus. IEEE Transactions on Big Data, 7(3):535–547.
  25. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pages 6769–6781. Association for Computational Linguistics.
  26. Vaibhav Kumar and Jamie Callan. 2020. Making information seeking easier: An improved pipeline for conversational search. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020, volume EMNLP 2020 of Findings of ACL, pages 3971–3980. Association for Computational Linguistics.
  27. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP 2023, Koblenz, Germany, October 23-26, 2023, pages 611–626. ACM.
  28. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  29. Can query expansion improve generalization of strong cross-encoder rankers? In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2024, Washington DC, USA, July 14-18, 2024, pages 2321–2326. ACM.
  30. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
  31. Contextualized query embeddings for conversational search. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 1004–1015. Association for Computational Linguistics.
  32. Multi-stage conversational passage retrieval: An approach to fusing term importance estimation and neural query rewriting. ACM Transactions on Information Systems (TOIS), 39(4):1–29.
  33. Conversational question reformulation via sequence-to-sequence architectures and pretrained language models. CoRR, abs/2004.01909.
  34. Large language model is not a good few-shot information extractor, but a good reranker for hard samples! In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 10572–10601. Association for Computational Linguistics.
  35. Chatretriever: Adapting large language models for generalized and robust conversational dense retrieval. CoRR, abs/2404.13556.
  36. Large language models know your contextual search intent: A prompting framework for conversational search. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 1211–1225. Association for Computational Linguistics.
  37. Curriculum contrastive context denoising for few-shot conversational dense retrieval. In SIGIR ’22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, July 11 - 15, 2022, pages 176–186. ACM.
  38. Convtrans: Transforming web search sessions for conversational dense retrieval. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 2935–2946. Association for Computational Linguistics.
  39. Learning denoised and interpretable session representation for conversational search. In Proceedings of the ACM Web Conference 2023, WWW 2023, Austin, TX, USA, 30 April 2023 - 4 May 2023, pages 3193–3202. ACM.
  40. A survey of conversational search. arXiv preprint arXiv:2410.15576.
  41. Learning to relate to previous turns in conversational search. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2023, Long Beach, CA, USA, August 6-10, 2023, pages 1722–1732. ACM.
  42. History-aware conversational dense retrieval. In Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, pages 13366–13378. Association for Computational Linguistics.
  43. Convsdg: Session data generation for conversational search. In Companion Proceedings of the ACM on Web Conference 2024, WWW 2024, Singapore, Singapore, May 13-17, 2024, pages 1634–1642. ACM.
  44. OpenAI. 2022. Openai: Introducing chatgpt.
  45. TREC cast 2022: Going beyond user ask and system retrieve with initiative and response generation. In Proceedings of the Thirty-First Text REtrieval Conference, TREC 2022, online, November 15-19, 2022, volume 500-338 of NIST Special Publication. National Institute of Standards and Technology (NIST).
  46. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6-12, 2002, Philadelphia, PA, USA, pages 311–318. ACL.
  47. Webbrain: Learning to generate factually correct articles for queries by grounding on large web corpus. CoRR, abs/2304.04358.
  48. Open-retrieval conversational question answering. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, SIGIR 2020, Virtual Event, China, July 25-30, 2020, pages 539–548. ACM.
  49. Parallel context windows for large language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 6383–6402. Association for Computational Linguistics.
  50. Coqa: A conversational question answering challenge. Transactions of the Association for Computational Linguistics, 7:249–266.
  51. Small models, big insights: Leveraging slim proxy models to decide when and what to retrieve for llms. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, pages 4420–4436. Association for Computational Linguistics.
  52. ByteDance Doubao Team. 2023. Doubao.
  53. Qwen Team. 2024. Qwen2.5: A party of foundation models.
  54. Question rewriting for conversational question answering. In WSDM ’21, The Fourteenth ACM International Conference on Web Search and Data Mining, Virtual Event, Israel, March 8-12, 2021, pages 355–363. ACM.
  55. Query resolution for conversational search with limited supervision. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, SIGIR 2020, Virtual Event, China, July 25-30, 2020, pages 921–930. ACM.
  56. Query2doc: Query expansion with large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 9414–9423. Association for Computational Linguistics.
  57. Richrag: Crafting rich responses for multi-faceted queries in retrieval-augmented generation. CoRR, abs/2406.12566.
  58. Learning to filter context for retrieval-augmented generation. CoRR, abs/2311.08377.
  59. CONQRR: conversational query rewriting for retrieval with reinforcement learning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 10000–10014. Association for Computational Linguistics.
  60. RECOMP: improving retrieval-augmented lms with compression and selective augmentation. CoRR, abs/2310.04408.
  61. List-aware reranking-truncation joint model for search and retrieval-augmented generation. In Proceedings of the ACM on Web Conference 2024, WWW 2024, Singapore, May 13-17, 2024, pages 1330–1340. ACM.
  62. Qwen2 technical report. arXiv preprint arXiv:2407.10671.
  63. PRCA: fitting black-box large language models for retrieval question answering via pluggable reward-driven contextual adapter. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 5364–5375. Association for Computational Linguistics.
  64. Boosting conversational question answering with fine-grained retrieval-augmentation and self-check. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2024, Washington DC, USA, July 14-18, 2024, pages 2301–2305. ACM.
  65. Few-shot generative conversational query rewriting. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, SIGIR 2020, Virtual Event, China, July 25-30, 2020, pages 1933–1936. ACM.
  66. Few-shot conversational dense retrieval. In SIGIR ’21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, July 11-15, 2021, pages 829–838. ACM.
  67. Llamafactory: Unified efficient fine-tuning of 100+ language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), Bangkok, Thailand. Association for Computational Linguistics.
  68. One token can help! learning scalable and pluggable virtual tokens for retrieval-augmented large language models. CoRR, abs/2405.19670.

Summary

  • The paper introduces CORAL, a benchmark that evaluates multi-turn conversational RAG systems using realistic, Wikipedia-derived dialogues.
  • It outlines a structured methodology with sampling strategies like LDS and DTRW to simulate complex conversational shifts.
  • Experiments reveal that while response quality plateaus with larger models, citation accuracy benefits from increased model scaling.

An Examination of "CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation Generation"

The paper of Retrieval-Augmented Generation (RAG) systems has seen considerable advancements in recent years, especially with their integration with LLMs for enhanced response quality in question-answering tasks. Nonetheless, academic evaluation has largely emphasized single-turn interactions, thereby neglecting the complexities of multi-turn conversations prevalent in realistic settings. The paper "CORAL: Benchmarking Multi-turn Conversational Retrieval-Augmentation Generation" introduces CORAL, a novel benchmark explicitly designed for evaluating RAG systems within multi-turn conversational contexts. This benchmark represents a significant stride towards bridging the existing gap between laboratory conditions and real-world applications in conversational AI.

CORAL derives its dataset from information-seeking dialogues generated from Wikipedia, ensuring comprehensive coverage across several dimensions critical for robust RAG evaluations. The key features of CORAL include open-domain coverage, knowledge-intensive inquiries, facilitation of free-form response generation, adaptation to topic shifts, and the provision of citation labeling. This compositional focus sets it apart from conventional datasets, addressing the multifaceted challenges that exist in multi-turn conversations.

The paper details a systematic methodology for converting raw Wikipedia content into a structured format suitable for evaluating conversational RAG systems. By leveraging the hierarchical structure of Wikipedia pages, the authors create complex informational flows that mimic genuine conversational shifts and dependencies. The benchmark includes 8,000 conversations, rigorously sampled and classified into types based on various sampling strategies such as Linear Descent Sampling (LDS) and Dual-Tree Random Walk (DTRW). These strategies explore different depths, breadths, and topical diversities of conversations, offering a realistic dataset for RAG evaluation.

CORAL supports three essential tasks: Conversational Passage Retrieval, Response Generation, and Citation Labeling, which collectively cover the primary functionalities required for optimal RAG system performance in real-world settings. The proposed unified framework standardizes the assessment of various conversational RAG approaches, thus facilitating a cohesive comparison across different methods.

The paper's experiments provide insights into the current efficacy and limitations of RAG systems when applied to multi-turn interactions. Evaluations using both open-source and commercial LLMs reveal opportunities for further refinement, especially in the context of citation accuracy and response sophistication. The deployment of alternative conversation compression strategies—such as LLM-based summarization of conversation history—demonstrates practical ways to mitigate the long context problem often encountered in extended dialogue history scenarios.

This research also offers a valuable discourse on the implications of model scaling. With an extensive analysis covering parameter scaling from 500 million to 7 billion, the authors highlight that while response quality tends to plateau beyond a certain model size, citation accuracy benefits from larger models, suggesting that different facets of conversational RAG systems might optimize at differing scales.

In conclusion, CORAL fulfills a critical need for comprehensive evaluation in conversational RAG, providing a versatile benchmark capable of advancing multi-turn dialogue systems towards practical use. The authors’ contributions lay a foundation that future research can build upon, notably in refining methods for context handling and response generation within dynamic, information-rich conversation settings. Future work will likely explore the integration of more sophisticated retrieval and generation architectures, as well as refining the simulation of complex conversational nuances further, contributing to the broader field of AI's practical deployment in interactive systems.

Youtube Logo Streamline Icon: https://streamlinehq.com