Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs (2407.02485v1)

Published 2 Jul 2024 in cs.CL, cs.AI, cs.IR, and cs.LG
RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs

Abstract: LLMs typically utilize the top-k contexts from a retriever in retrieval-augmented generation (RAG). In this work, we propose a novel instruction fine-tuning framework RankRAG, which instruction-tunes a single LLM for the dual purpose of context ranking and answer generation in RAG. In particular, the instruction-tuned LLMs work surprisingly well by adding a small fraction of ranking data into the training blend, and outperform existing expert ranking models, including the same LLM exclusively fine-tuned on a large amount of ranking data. For generation, we compare our model with many strong baselines, including GPT-4-0613, GPT-4-turbo-2024-0409, and ChatQA-1.5, an open-sourced model with the state-of-the-art performance on RAG benchmarks. Specifically, our Llama3-RankRAG significantly outperforms Llama3-ChatQA-1.5 and GPT-4 models on nine knowledge-intensive benchmarks. In addition, it also performs comparably to GPT-4 on five RAG benchmarks in the biomedical domain without instruction fine-tuning on biomedical data, demonstrating its superb capability for generalization to new domains.

RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs

Introduction

The paper "RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs" addresses a critical challenge in the domain of retrieval-augmented generation (RAG) with LLMs. Traditional RAG techniques rely on a retriever to fetch the top-k contexts for question answering, where k is typically small due to efficiency and accuracy concerns. However, this approach encounters several limitations, such as the inability of LLMs to efficiently process numerous chunked contexts and the intrinsic limitations of existing retrievers in learning effective local alignments across large embedding spaces. The RankRAG framework proposed in this paper aims to overcome these issues by instruction fine-tuning a single LLM for both context ranking and answer generation in RAG scenarios.

Key Contributions

The paper presents several notable contributions to the field:

  • Unified Instruction-Tuning Framework: The core innovation of RankRAG is the unified instruction-tuning framework that enables a single LLM to perform both context ranking and answer generation. This is achieved by incorporating a small fraction of ranking data into the instruction-tuning blend, significantly enhancing the LLM's capability to identify relevant contexts and generate accurate answers.
  • Effective Data Integration: RankRAG integrates context-rich question-answer datasets, retrieval-augmented QA, and ranking datasets. This enhances the LLM's ability to filter out irrelevant contexts during both the retrieval and generation phases of RAG.
  • Empirical Superiority: The RankRAG model, particularly in its Llama3-RankRAG variants, outperforms several strong baselines, including high-performing models like GPT-4 and GPT-4-turbo, on various benchmarks. Additionally, it shows superb generalization capabilities to new domains, such as the biomedical field, even without instruction fine-tuning on domain-specific data.

Experimental Evaluation

Setup

The experimental setup involves evaluating RankRAG on nine knowledge-intensive benchmarks, including:

  1. Open-domain QA: NQ, TriviaQA, PopQA, HotpotQA, 2WikimQA
  2. Fact Verification: FEVER
  3. Conversational QA: Doc2Dial, TopiOCQA, INSCIT

Results and Analysis

Performance on General-Domain Tasks: RankRAG consistently surpassed strong baselines across various QA tasks. For example, Llama3-RankRAG-8B significantly outperformed Llama3-ChatQA-1.5-8B and GPT-4 models on datasets like NQ and TriviaQA. This demonstrates the effectiveness of integrating context ranking within the instruction-tuning process.

Zero-Shot Generalization: Remarkably, RankRAG performed comparably to GPT-4 on biomedical domain tasks without specific fine-tuning on biomedical data. This aspect highlights its robust generalization capability and practical utility in diverse application domains.

Implications and Future Directions

The implications of this research are profound for both the practical deployment and theoretical understanding of RAG systems:

  • Enhanced Practical Utility: By unifying context ranking with answer generation, RankRAG eliminates the need for separate ranking models, simplifying the deployment pipeline and potentially reducing latency.
  • Scalability and Efficiency: The demonstrated data efficiency in achieving superior performance with fewer ranking samples suggests that RankRAG can be scaled effectively for various large-scale real-world applications.
  • Theoretical Insights: This paper underscores the mutual enhancement between context ranking and answer generation within an LLM. Further exploration into this synergy might offer deeper theoretical insights into optimizing multi-task instruction tuning.

Conclusion

RankRAG represents a significant advancement in the field of RAG techniques for LLMs. By successfully unifying context ranking with retrieval-augmented generation through instruction fine-tuning, it addresses several critical limitations of existing RAG pipelines. The empirical results validate its effectiveness and robustness across both general-domain and specialized tasks. Future work could explore finer-grained instruction-tuning strategies and further optimize the efficiency and scalability of the RankRAG framework, potentially expanding its applicability to even broader AI and NLP applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (101)
  1. Topiocqa: Open-domain conversational question answering with topic switching. TACL, 2022.
  2. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  3. Anthropic. Model card and evaluations for claude models. 2023.
  4. Self-RAG: Learning to retrieve, generate, and critique through self-reflection. In ICLR, 2024a.
  5. Reliable, adaptable, and attributable language models with retrieval. arXiv preprint arXiv:2403.03187, 2024b.
  6. Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268, 2016.
  7. Semantic parsing on freebase from question-answer pairs. In EMNLP, 2013.
  8. Improving language models by retrieving from trillions of tokens. In ICML. PMLR, 2022.
  9. Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation, 2023a.
  10. Meditron-70b: Scaling medical pretraining for large language models. arXiv preprint arXiv:2311.16079, 2023b.
  11. Scaling instruction-finetuned language models. JMLR, 25(70), 2024.
  12. Free Dolly: Introducing the world’s first truly open instruction-tuned llm, 2023.
  13. Quoref: A reading comprehension dataset with questions requiring coreferential reasoning. In EMNLP, 2019.
  14. DeepSeek. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model, 2024.
  15. Glam: Efficient scaling of language models with mixture-of-experts. In ICML, 2022.
  16. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In NAACL, 2019.
  17. Eli5: Long form question answering. In ACL, 2019.
  18. doc2dial: A goal-oriented document-grounded dialogue dataset. In EMNLP, 2020.
  19. Re2G: Retrieve, rerank, generate. In NAACL, 2022.
  20. Retrieval augmented language model pre-training. In ICML, 2020.
  21. Measuring massive multitask language understanding. In ICLR, 2021.
  22. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. In COLING, 2020.
  23. Unnatural instructions: Tuning language models with (almost) no human labor. In ACL, 2023.
  24. Raven: In-context learning with retrieval augmented encoder-decoder language models. arXiv preprint arXiv:2308.07922, 2023.
  25. Leveraging passage retrieval with generative models for open domain question answering. In EACL, 2021.
  26. Unsupervised dense information retrieval with contrastive learning. TMLR, 2022.
  27. Atlas: Few-shot learning with retrieval augmented language models. JMLR, 24(251):1–43, 2023.
  28. Adaptive-rag: Learning to adapt retrieval-augmented large language models through question complexity. In NAACL, 2024.
  29. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
  30. Active retrieval augmented generation. In EMNLP, 2023.
  31. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421, 2021.
  32. Pubmedqa: A dataset for biomedical research question answering. In EMNLP, 2019.
  33. Medcpt: Contrastive pre-trained transformers with large-scale pubmed search logs for zero-shot biomedical information retrieval. Bioinformatics, 39(11), 2023.
  34. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In ACL, 2017.
  35. Dense passage retrieval for open-domain question answering. In EMNLP, 2020.
  36. Realtime QA: What’s the answer right now? In NeurIPS, 2023.
  37. Few-shot reranking for multi-hop QA via language model prompting. In ACL, 2023.
  38. Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive nlp. arXiv preprint arXiv:2212.14024, 2022.
  39. Soda: Million-scale dialogue distillation with social commonsense contextualization. In EMNLP, 2023.
  40. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  41. The narrativeqa reading comprehension challenge. TACL, 2018.
  42. Natural questions: a benchmark for question answering research. TACL, 2019.
  43. Openassistant conversations - democratizing large language model alignment. arXiv preprint arXiv: 2304.07327, 2023.
  44. Internet-augmented language models through few-shot prompting for open-domain question answering. arXiv preprint arXiv:2203.05115, 2022.
  45. Nv-embed: Improved techniques for training llms as generalist embedding models. arXiv preprint arXiv:2405.17428, 2024.
  46. Retrieval-augmented generation for knowledge-intensive nlp tasks. NeurIPS, 33, 2020.
  47. Reasoning over paragraph effects in situations. In Workshop on Machine Reading for Question Answering, 2019.
  48. How to train your dragon: Diverse augmentation towards generalizable dense retrieval. In Findings of EMNLP, 2023.
  49. RA-DIT: Retrieval-augmented dual instruction tuning. In ICLR, 2024.
  50. Chatqa: Surpassing gpt-4 on conversational qa and rag. arXiv preprint arXiv:2401.10225, 2024.
  51. The flan collection: Designing data and methods for effective instruction tuning. In ICML, 2023.
  52. Sparse, dense, and attentional representations for text retrieval. TACL, 2021.
  53. Sail: Search-augmented instruction learning. arXiv preprint arXiv:2305.15225, 2023.
  54. Fine-tuning llama for multi-stage text retrieval. arXiv preprint arXiv:2310.08319, 2023.
  55. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In ACL, 2023.
  56. In defense of dual-encoders for neural ranking. In ICML, 2022.
  57. Meta-AI. Llama 3 model card. 2024.
  58. Mistral. Mixtral 8x22b. 2024. URL https://mistral.ai/news/mixtral-8x22b/.
  59. An introduction to neural information retrieval. Foundations and Trends® in Information Retrieval, 2018.
  60. Generative representational instruction tuning. arXiv preprint arXiv:2402.09906, 2024.
  61. Document ranking with a pretrained sequence-to-sequence model. In Findings of EMNLP, 2020.
  62. OpenAI. Introducing ChatGPT, 2022.
  63. OpenAI. GPT-4, 2023.
  64. Proving test set contamination in black-box language models. In ICLR, 2024.
  65. Training language models to follow instructions with human feedback. NeurIPS, 35, 2022.
  66. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In CHIL, 2022.
  67. KILT: a benchmark for knowledge intensive language tasks. In NAACL, 2021.
  68. Large language models are effective text rankers with pairwise ranking prompting. In Findings of NAACL, 2024.
  69. Squad: 100,000+ questions for machine comprehension of text. In EMNLP, 2016.
  70. In-context retrieval-augmented language models. TACL, 2023.
  71. Simple bm25 extension to multiple weighted fields. In CIKM, 2004.
  72. End-to-end training of multi-document reader and retriever for open-domain question answering. In NeurIPS, 2021.
  73. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. In Findings of EMNLP, 2023.
  74. Replug: Retrieval-augmented black-box language models. In NAACL, 2024.
  75. Is ChatGPT good at search? investigating large language models as re-ranking agents. In EMNLP, 2023.
  76. Beir: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In NeurIPS, 2021.
  77. Fever: A large-scale dataset for fact extraction and verification. In NAACL, 2018.
  78. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  79. Newsqa: A machine comprehension dataset. In RepL4NLP Workshop at ACL, 2017.
  80. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. In ACL, 2023.
  81. An overview of the bioasq large-scale biomedical semantic indexing and question answering competition. BMC bioinformatics, 2015.
  82. Instructretro: Instruction tuning post retrieval-augmented pretraining. In ICML, 2024.
  83. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022.
  84. Improving text embeddings with large language models. arXiv preprint arXiv:2401.00368, 2023a.
  85. Self-instruct: Aligning language models with self-generated instructions. In ACL, 2023b.
  86. Learning to filter context for retrieval-augmented generation. arXiv preprint arXiv:2311.08377, 2023c.
  87. Finetuned language models are zero-shot learners. In ICLR, 2022.
  88. Pmc-llama: toward building open-source language models for medicine. JAMIA, 2024.
  89. Inscit: Information-seeking conversations with mixed-initiative interactions. TACL, 2023.
  90. Benchmarking retrieval-augmented generation for medicine. arXiv preprint arXiv:2402.13178, 2024.
  91. RECOMP: Improving retrieval-augmented LMs with context compression and selective augmentation. In ICLR, 2024a.
  92. Retrieval meets long context large language models. In ICLR, 2024b.
  93. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In EMNLP, 2018.
  94. Making retrieval-augmented language models robust to irrelevant context. In ICLR, 2024.
  95. Generate rather than retrieve: Large language models are strong context generators. In ICLR, 2023a.
  96. Chain-of-note: Enhancing robustness in retrieval-augmented language models. arXiv preprint arXiv:2311.09210, 2023b.
  97. Improving language models via plug-and-play retrieval feedback, 2024.
  98. Coco-dr: Combating distribution shift in zero-shot dense retrieval with contrastive and distributionally robust learning. In EMNLP, 2022.
  99. Raft: Adapting language model to domain specific rag. arXiv preprint arXiv:2403.10131, 2024.
  100. Tat-qa: A question answering benchmark on a hybrid of tabular and textual content in finance. In ACL, 2021.
  101. Inters: Unlocking the power of large language models in search with instruction tuning. In ACL, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Yue Yu (343 papers)
  2. Wei Ping (51 papers)
  3. Zihan Liu (102 papers)
  4. Boxin Wang (28 papers)
  5. Jiaxuan You (50 papers)
  6. Chao Zhang (907 papers)
  7. Mohammad Shoeybi (60 papers)
  8. Bryan Catanzaro (123 papers)
Citations (23)
Youtube Logo Streamline Icon: https://streamlinehq.com