Emergent Mind


CIS is a prominent area in IR that focuses on developing interactive knowledge assistants. These systems must adeptly comprehend the user's information requirements within the conversational context and retrieve the relevant information. To this aim, the existing approaches model the user's information needs with one query called rewritten query and use this query for passage retrieval. In this paper, we propose three different methods for generating multiple queries to enhance the retrieval. In these methods, we leverage the capabilities of large language models (LLMs) in understanding the user's information need and generating an appropriate response, to generate multiple queries. We implement and evaluate the proposed models utilizing various LLMs including GPT-4 and Llama-2 chat in zero-shot and few-shot settings. In addition, we propose a new benchmark for TREC iKAT based on gpt 3.5 judgments. Our experiments reveal the effectiveness of our proposed models on the TREC iKAT dataset.
System breaks down complex user request into queries, searches passages, and generates informed answer.


  • The paper proposes methods to enhance conversational response retrieval using LLMs, aiming to address the limitations of existing systems that fail to capture complex information needs.

  • Three novel approaches are introduced: Answer-driven Query Generation (AD), Query Generation (QD), and Answer and Query Generation (AQD), with an additional variant AQDAnswer that re-ranks results for better passage retrieval.

  • Experiments on the TREC iKAT dataset show that the AQD and AD methods, especially with GPT-4, significantly outperform baselines, indicating the benefit of multiple queries and re-ranking strategies.

  • The study suggests the importance of leveraging LLMs not just for response generation but within the retrieval process itself, opening avenues for future work on optimizing query generation and integrating user feedback.

Generate then Retrieve: Enhancing Conversational Response Retrieval with LLMs

Methods Overview

The paper introduces novel approaches to improve conversational response retrieval by leveraging LLMs. It identifies the main limitation of existing retrieval systems, which typically employ a single rewritten query for passage retrieval, failing to address complex information needs that require reasoning over multiple facts. To overcome this, the authors propose three methods:

  1. Answer-driven Query Generation (AD): Utilizing the LLM's generated answer as a singular long query for retrieval.
  2. Query Generation (QD): Prompting the LLM to directly generate multiple queries from the conversational context.
  3. Answer and Query Generation (AQD): A two-step method where the LLM first generates an answer and then produces multiple queries to refine this answer.

An additional variant, AQDAnswer, re-ranks results based on predicted relevance to the LLM's generated response, aiming to improve the quality of retrieved passages. The paper compares these methods against standard approaches and evaluates them using LLMs including GPT-4 and Llama-2 in different settings.

Experimental Setup and Results

The experiments are conducted on the TREC Interactive Knowledge Assistance Track (iKAT) dataset, showcasing the complexity of conversational information seeking tasks. The proposed methods are evaluated against baselines that follow either generate-then-retrieval or retrieval-then-generate paradigms, using a variety of LLMs.

Results indicate that AQD and AD methods, particularly when utilizing GPT-4, significantly outperform the baselines. AQD shows superior performance over single-query rewriting approaches (QR) and even outpaces human-rewritten queries in certain metrics. Notably, AQDAnswer's re-ranking strategy based on the initial generated answer leads to further improvements, showcasing the potential of LLMs in enhancing retrieval through a nuanced understanding of the conversational context and the user's information need.

Implications and Future Work

This study presents a significant shift towards utilizing the generative capabilities of LLMs for improving information retrieval in conversational systems. By demonstrating that multiple queries generated from LLMs' responses can lead to better retrieval outcomes, it opens up new avenues for research in conversational search systems. It also highlights the importance of leveraging LLMs not just for generating responses but as integral components of the information retrieval process.

One promising direction for future work is exploring the optimal number of queries to generate and the impact of query quality on retrieval effectiveness. Additionally, integrating user feedback into the generative process could further personalize and refine the retrieval outcomes, making the conversational system more responsive to the user's specific needs.

Ethical Considerations and Limitations

The reliance on LLMs introduces potential biases and errors inherent in these models, which can affect the quality of generated responses and queries. Moreover, the effectiveness of the proposed methods is contingent upon the quality of the LLM's initial response, highlighting a dependency that could be problematic if the LLM fails to understand the user's request accurately. Future research should address these challenges, ensuring that conversational systems remain reliable, unbiased, and user-centric in their approach to information retrieval.

Get summaries of trending AI papers delivered straight to your inbox

Unsubscribe anytime.

  1. TREC iKAT 2023: The Interactive Knowledge Assistance Track Overview
  2. Conversational search (Dagstuhl Seminar 19461). In Dagstuhl Reports, volume 9. Schloss Dagstuhl-Leibniz-Zentrum für Informatik.
  3. Open-domain question answering goes conversational via question rewriting. In NAACL-HLT, pages 520–534. Association for Computational Linguistics.
  4. QuAC : Question Answering in Context
  5. TREC CAsT 2019: The Conversational Assistance Track Overview
  6. Wizard of wikipedia: Knowledge-powered conversational agents. In ICLR (Poster). OpenReview.net.
  7. Can you unpack that? learning to rewrite questions-in-context. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5918–5924, Hong Kong, China. Association for Computational Linguistics.
  8. Perspectives on large language models for relevance judgment. In ICTIR, pages 39–50. ACM.
  9. Multidoc2dial: Modeling dialogues grounded in multiple documents. In EMNLP (1), pages 6162–6176. Association for Computational Linguistics.
  10. Precise zero-shot dense retrieval without relevance labels. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1762–1777, Toronto, Canada. Association for Computational Linguistics.
  11. SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization
  12. Cosplade: Contextualizing SPLADE for conversational information retrieval. In ECIR (1), volume 13980 of Lecture Notes in Computer Science, pages 537–552. Springer.
  13. Knowledge-grounded dialogue generation with a unified knowledge representation. In NAACL-HLT, pages 206–218. Association for Computational Linguistics.
  14. Pyserini: A Python toolkit for reproducible information retrieval research with sparse and dense representations. In Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021), pages 2356–2362.
  15. Multi-stage conversational passage retrieval: An approach to fusing term importance estimation and neural query rewriting. ACM Transactions on Information Systems (TOIS), 39(4):1–29.
  16. LLMs as Narcissistic Evaluators: When Ego Inflates Evaluation Scores
  17. Faithful Chain-of-Thought Reasoning
  18. Sean MacAvaney and Luca Soldaini. 2023. One-shot labeling for automatic relevance estimation. In SIGIR, pages 2230–2235. ACM.
  19. Quinn Patwardhan and Grace Hui Yang. 2023. Sequencing matters: A generate-retrieve-generate model for building conversational agents.
  20. Hongjin Qian and Zhicheng Dou. 2022. Explicit query rewriting for conversational dense retrieval. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 4725–4737.
  21. Filip Radlinski and Nick Craswell. 2017. A theoretical framework for conversational search. In CHIIR, pages 117–126.
  22. Stephen E. Robertson and Hugo Zaragoza. 2009. The probabilistic relevance framework: BM25 and beyond. Found. Trends Inf. Retr., 3(4):333–389.
  23. WikiChat: Stopping the hallucination of large language model chatbots by few-shot grounding on Wikipedia. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 2387–2413, Singapore. Association for Computational Linguistics.
  24. Language models that seek for knowledge: Modular search & generation for dialogue and prompt completion. In EMNLP (Findings), pages 373–393. Association for Computational Linguistics.
  25. BlenderBot 3: a deployed conversational agent that continually learns to responsibly engage
  26. Question rewriting for conversational question answering. In Proceedings of the 14th ACM international conference on web search and data mining, pages 355–363.
  27. Ilps at trec 2019 conversational assistant track. In TREC.
  28. Query resolution for conversational search with limited supervision. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM.
  29. Few-shot generative conversational query rewriting. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 1933–1936.
  30. Few-shot conversational dense retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 829–838.
  31. Generate rather than retrieve: Large language models are strong context generators. In ICLR. OpenReview.net.
  32. Conversational Information Seeking

Show All 32