Papers
Topics
Authors
Recent
Search
2000 character limit reached

RQ-RAG: Learning to Refine Queries for Retrieval Augmented Generation

Published 31 Mar 2024 in cs.CL | (2404.00610v1)

Abstract: LLMs exhibit remarkable capabilities but are prone to generating inaccurate or hallucinatory responses. This limitation stems from their reliance on vast pretraining datasets, making them susceptible to errors in unseen scenarios. To tackle these challenges, Retrieval-Augmented Generation (RAG) addresses this by incorporating external, relevant documents into the response generation process, thus leveraging non-parametric knowledge alongside LLMs' in-context learning abilities. However, existing RAG implementations primarily focus on initial input for context retrieval, overlooking the nuances of ambiguous or complex queries that necessitate further clarification or decomposition for accurate responses. To this end, we propose learning to Refine Query for Retrieval Augmented Generation (RQ-RAG) in this paper, endeavoring to enhance the model by equipping it with capabilities for explicit rewriting, decomposition, and disambiguation. Our experimental results indicate that our method, when applied to a 7B Llama2 model, surpasses the previous state-of-the-art (SOTA) by an average of 1.9\% across three single-hop QA datasets, and also demonstrates enhanced performance in handling complex, multi-hop QA datasets. Our code is available at https://github.com/chanchimin/RQ-RAG.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Self-rag: Learning to retrieve, generate, and critique through self-reflection. In The Twelfth International Conference on Learning Representations.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  3. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.
  4. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. arXiv preprint arXiv:2011.01060.
  5. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.
  6. Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906.
  7. Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858.
  8. Openassistant conversations-democratizing large language model alignment. Advances in Neural Information Processing Systems, 36.
  9. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
  10. Quark: Controllable text generation with reinforced unlearning. Advances in neural information processing systems, 35:27591–27609.
  11. Search augmented instruction learning. In The 2023 Conference on Empirical Methods in Natural Language Processing.
  12. Query rewriting for retrieval-augmented large language models. arXiv preprint arXiv:2305.14283.
  13. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. arXiv preprint arXiv:2212.10511.
  14. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789.
  15. Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707.
  16. OpenAI (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  17. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744.
  18. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67.
  19. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389.
  20. Large language models can be easily distracted by irrelevant context. In International Conference on Machine Learning, pages 31210–31227. PMLR.
  21. Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652.
  22. Retrieval augmentation reduces hallucination in conversation. arXiv preprint arXiv:2104.07567.
  23. Asqa: Factoid questions meet long-form answers. arXiv preprint arXiv:2204.06092.
  24. Recitation-augmented language models. arXiv preprint arXiv:2210.01296.
  25. Stanford alpaca: An instruction-following llama model.
  26. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  27. Musique: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10:539–554.
  28. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. The 61st Annual Meeting of the Association for Computational Linguistics.
  29. Freshllms: Refreshing large language models with search engine augmentation. arXiv preprint arXiv:2310.03214.
  30. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
  31. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837.
  32. Align on the fly: Adapting chatbot behavior to established norms. arXiv preprint arXiv:2312.15907.
  33. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244.
  34. Recomp: Improving retrieval-augmented lms with compression and selective augmentation. arXiv preprint arXiv:2310.04408.
  35. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600.
  36. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36.
  37. Making retrieval-augmented language models robust to irrelevant context. arXiv preprint arXiv:2310.01558.
  38. Chain-of-note: Enhancing robustness in retrieval-augmented language models. arXiv preprint arXiv:2311.09210.
  39. Lima: Less is more for alignment. Advances in Neural Information Processing Systems, 36.
  40. Docprompting: Generating code by retrieving the docs. arXiv preprint arXiv:2207.05987.
Citations (31)

Summary

  • The paper presents an innovative query refinement approach that enhances retrieval-augmented generation accuracy by rephrasing and disambiguating queries.
  • It integrates a 7B Llama2 model with iterative techniques and a robust dataset constructed using ChatGPT for diverse query scenarios.
  • Experimental evaluations show significant improvements in accuracy and robustness across single-hop and multi-hop QA tasks using ensemble selection strategies.

RQ-RAG: Learning to Refine Queries for Retrieval Augmented Generation

Introduction

This paper addresses the inherent limitations of LLMs by integrating retrieval functionality to enhance response generation. While LLMs excel in a variety of tasks, they are static post-training and prone to "hallucinations" due to outdated or incomplete pretraining data. Thus, Retrieval-Augmented Generation (RAG) is employed, which combines the strengths of parametric and non-parametric knowledge, to overcome these challenges. Traditional RAG models often do not refine or clarify queries sufficiently, which this paper aims to address through the RQ-RAG framework, enhancing the retrieval process by explicitly refining queries to improve the accuracy and relevance of responses. Figure 1

Figure 1: Our model are learned to search on demand, rewrite, decompose and disambiguate a query when needed.

Methodology

RQ-RAG Framework

The RQ-RAG framework is built upon a 7B Llama2 model, augmented to refine queries intelligently through mechanisms like rewriting, decomposing, and disambiguating. This approach is inspired by prior works such as Self-RAG and SAIL, but introduces unique techniques in dataset preparation and query refinement, relying on ChatGPT to improve search queries iteratively across different scenarios.

Dataset Construction

The dataset construction involves transforming existing query-answer pairs into an enriched format that includes refined queries and contextually-generated responses. This is achieved through a series of automated steps involving ChatGPT to create scenarios that mimic real-life query complexities. This step ensures a robust dataset that captures diverse query reconstruction scenarios for model training. Figure 2

Figure 2: Dataset construction pipeline.

Generator Training and Sampling Strategies

The model is trained to maximize the probability of generating correct responses given refined queries and their corresponding retrieved documents, formulated through an auto-regressive process. During inference, multiple strategies are employed to determine the most appropriate trajectory for generating a response. These include perplexity-based, confidence-based, and ensemble-based strategies, which are carefully designed to select the optimal path without relying on external LLMs.

Experimental Evaluation

Evaluation Tasks

The RQ-RAG framework was evaluated on both single-hop and multi-hop QA tasks, demonstrating significant improvements in accuracy and relevance over baseline methods. The datasets for evaluation included ARC-Challenge, PopQA, and OpenbookQA for single-hop tasks, and HotpotQA, 2WikiMultiHopQA, and MuSiQue for multi-hop tasks.

Results

The results indicate that RQ-RAG outperforms state-of-the-art methods in single-hop QA tasks with a notable increase in accuracy. It also shows superior performance in multi-hop QA tasks compared to much larger models like ChatGPT, affirming its efficiency and effectiveness. Notably, the ensemble-based selection strategy notably enhances performance, especially in multi-hop scenarios. Figure 3

Figure 3: Performance of different sampling strategies on six tasks.

High Upper Bound Potential

The system demonstrated a high upper bound in potential by effectively exploring multiple retrieval paths, significantly increasing the likelihood of obtaining correct answers. This underscores the importance of query refinement and retrieval variability in enhancing model performance.

Discussion

The innovative regenerating of answers based on retrieval-crafted contexts rather than relying on original dataset answers significantly impacts performance, particularly in multi-hop QA tasks. This approach ensures answers are aligned with current contexts, enhancing accuracy and coherence. Figure 4

Figure 4: The impact of retention ratio on HotpotQA.

Resilience to Data Source Variability

The RQ-RAG framework demonstrated resilience across different data retrieval sources, maintaining robust performance regardless of the source. This contrasts with earlier models that showed performance dependency on specific data sources, highlighting RQ-RAG's robustness and applicability to a broader range of applications.

Conclusion

The RQ-RAG framework marks a substantial advance in the refinement of queries for retrieval-augmented LLMs. By crafting a dataset that incorporates query rewriting, decomposition, and disambiguation, RQ-RAG significantly enhances retrieval effectiveness and model performance. Future work may explore even more sophisticated methods of evaluating and selecting retrieval pathways, bolstering the promise of this approach for complex, evolving information environments.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 4 likes about this paper.