Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Benchmarking Retrieval-Augmented Generation for Medicine (2402.13178v2)

Published 20 Feb 2024 in cs.CL and cs.AI
Benchmarking Retrieval-Augmented Generation for Medicine

Abstract: While LLMs have achieved state-of-the-art performance on a wide range of medical question answering (QA) tasks, they still face challenges with hallucinations and outdated knowledge. Retrieval-augmented generation (RAG) is a promising solution and has been widely adopted. However, a RAG system can involve multiple flexible components, and there is a lack of best practices regarding the optimal RAG setting for various medical purposes. To systematically evaluate such systems, we propose the Medical Information Retrieval-Augmented Generation Evaluation (MIRAGE), a first-of-its-kind benchmark including 7,663 questions from five medical QA datasets. Using MIRAGE, we conducted large-scale experiments with over 1.8 trillion prompt tokens on 41 combinations of different corpora, retrievers, and backbone LLMs through the MedRAG toolkit introduced in this work. Overall, MedRAG improves the accuracy of six different LLMs by up to 18% over chain-of-thought prompting, elevating the performance of GPT-3.5 and Mixtral to GPT-4-level. Our results show that the combination of various medical corpora and retrievers achieves the best performance. In addition, we discovered a log-linear scaling property and the "lost-in-the-middle" effects in medical RAG. We believe our comprehensive evaluations can serve as practical guidelines for implementing RAG systems for medicine.

Benchmarking Retrieval-Augmented Generation for Medical Question Answering

Introduction to Retrieval-Augmented Generation (RAG) in Medicine

Recent advancements in LLMs have significantly contributed to enhancing medical question answering (QA) systems. However, challenges such as the generation of inaccurate information ("hallucinations") and the use of outdated knowledge persist, raising concerns particularly in high-stakes fields like healthcare. Retrieval-Augmented Generation (RAG) has emerged as a promising approach to mitigate these issues by grounding LLM responses in relevant, retrieved documents from trustworthy sources. The flexibility inherent in RAG systems, due to their modular nature comprising of retrievers, corpora, and LLM backbones, mandates a comprehensive evaluation to delineate best practices for their implementation in medical contexts.

The Mirage Benchmark and MedRag Toolkit

To address this need for systematic evaluation, the Medical Information Retrieval-Augmented Generation Evaluation (Mirage) benchmark was introduced. Comprising 7,663 questions from five essential medical QA datasets, Mirage facilitates the examination of RAG systems' zero-shot capabilities across various medical question types. Alongside Mirage, a toolkit named MedRag was proposed, offering an accessible means to configure and test different combinations of RAG components, consisting of five distinct corpora, four retrieval algorithms, and six LLMs. This toolkit not only aids in the practical application of RAG systems in medicine but also in conducting large-scale, nuanced analyses to uncover correlations between system configurations and their performance on the benchmark.

Insights from the Evaluation

The evaluation of RAG systems using Mirage surfaced several key findings:

  • A significant enhancement in LLM performance, by up to 18%, was observed when employing RAG over traditional chain-of-thought prompting. Remarkably, certain configurations enabled GPT-3.5 and Mixtral models to rival the performance of their more advanced counterpart, GPT-4.
  • Preference for retrieval corpora varied with the task, highlighting the importance of corpus selection in RAG system configuration. The comprehensive MedCorp corpus, amalgamating multiple sources, emerged as a robust option across tasks, suggesting the value in cross-source retrieval.
  • Among retrievers, domain-specific options like MedCPT showed superior performance in medical contexts. The implementation of fusion methods, such as Reciprocal Rank Fusion, further improved retrieval outcomes by aggregating results from multiple retrievers.
  • The paper unveiled scaling properties indicating a log-linear relationship between model performance and the number of retrieved snippets. A "lost-in-the-middle" effect was identified, underscoring the nuanced impact of snippet positioning on answer accuracy.

Future Directions and Recommendations

The extensive analysis provided by the Mirage benchmark and MedRag toolkit lays the groundwork for future research and the refinement of medical RAG systems. Based on the results, several practical recommendations were proposed, including the selection of comprehensive corpora like MedCorp and the employment of domain-specific retrievers, especially in tasks where relevant literature is paramount.

Moreover, the observed performance scaling and snippet positioning effects invite further exploration into the optimization of retrieval depth and order. Additionally, the feasibility of incorporating newer RAG architectures and other potentially beneficial resources into MedRag presents promising avenues for enhancing the model's utility and reliability in medical QA.

Conclusion

In conclusion, the introduction of Mirage and MedRag represents a significant stride towards the optimization of RAG systems for medical question answering. Through systematic benchmarking, this work illuminates the pathways through which RAG configurations can be tailored to maximize accuracy and reliability in medical QA, marking an essential contribution to the field of computational healthcare.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. Overview of the mediqa 2019 shared task on textual inference, question entailment and question answering. In Proceedings of the 18th BioNLP Workshop and Shared Task, pages 370–379.
  2. Construction of the literature graph in semantic scholar. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers), pages 84–91.
  3. Palm 2 technical report. arXiv preprint arXiv:2305.10403.
  4. Sofia J Athenikos and Hyoil Han. 2010. Biomedical question answering: A survey. Computer methods and programs in biomedicine, 99(1):1–24.
  5. Asma Ben Abacha and Dina Demner-Fushman. 2019. A question-entailment approach to question answering. BMC bioinformatics, 20(1):1–23.
  6. Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pages 2206–2240. PMLR.
  7. Large language models in biomedical natural language processing: benchmarks, baselines, and recommendations. arXiv preprint arXiv:2305.16326.
  8. Meditron-70b: Scaling medical pretraining for large language models. arXiv preprint arXiv:2311.16079.
  9. Specter: Document-level representation learning using citation-informed transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2270–2282.
  10. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pages 758–759.
  11. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  12. How user intelligence is improving pubmed. Nature biotechnology, 36(10):937–945.
  13. Bioreader: a retrieval-enhanced text-to-text transformer for biomedical literature. In Proceedings of the 2022 conference on empirical methods in natural language processing, pages 5770–5793.
  14. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997.
  15. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3(1):1–23.
  16. Measuring massive multitask language understanding. In International Conference on Learning Representations.
  17. Unsupervised dense information retrieval with contrastive learning. Transactions on Machine Learning Research.
  18. Improving medical reasoning through retrieval and self-reflection with retrieval-augmented large language models. arXiv preprint arXiv:2401.15269.
  19. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.
  20. Mixtral of experts. arXiv preprint arXiv:2401.04088.
  21. Active retrieval augmented generation. arXiv preprint arXiv:2305.06983.
  22. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421.
  23. Pubmedqa: A dataset for biomedical research question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2567–2577.
  24. Medcpt: Contrastive pre-trained transformers with large-scale pubmed search logs for zero-shot biomedical information retrieval. Bioinformatics, 39(11):btad651.
  25. Retrieve, summarize, and verify: How will chatgpt impact information seeking from the medical literature? Journal of the American Society of Nephrology, pages 10–1681.
  26. Pubmed and beyond: biomedical literature search in the age of artificial intelligence. EBioMedicine, 100.
  27. Biomedical question answering: a survey of approaches and challenges. ACM Computing Surveys (CSUR), 55(2):1–36.
  28. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781.
  29. Bioasq-qa: A manually curated corpus for biomedical question answering. Scientific Data, 10(1):170.
  30. Paperqa: Retrieval-augmented generative agent for scientific research. arXiv preprint arXiv:2312.07559.
  31. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240.
  32. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
  33. Can large language models reason about medical questions? arXiv preprint arXiv:2207.08143.
  34. Pyserini: A python toolkit for reproducible information retrieval research with sparse and dense representations. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2356–2362.
  35. Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172.
  36. Zhiyong Lu. 2011. Pubmed and beyond: a survey of web tools for searching biomedical literature. Database, 2011:baq036.
  37. Biogpt: generative pre-trained transformer for biomedical text generation and mining. Briefings in Bioinformatics, 23(6):bbac409.
  38. Augmented language models: a survey. arXiv preprint arXiv:2302.07842.
  39. Literature-augmented clinical outcome prediction. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 438–453.
  40. Capabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:2303.13375.
  41. Can generalist foundation models outcompete special-purpose tuning? case study in medicine. arXiv preprint arXiv:2311.16452.
  42. Gpt-4 technical report.
  43. Neighborhood contrastive learning for scientific document representations with citation embeddings. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11670–11688.
  44. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Conference on Health, Inference, and Learning, pages 248–260. PMLR.
  45. In-context retrieval-augmented language models. arXiv preprint arXiv:2302.00083.
  46. Biolord: Learning ontological representations from definitions for biomedical concepts and their textual descriptions. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 1454–1465.
  47. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389.
  48. Large language models encode clinical knowledge. Nature, 620(7972):172–180.
  49. Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617.
  50. Sarvesh Soni and Kirk Roberts. 2020. Evaluation of dataset selection for pre-training and fine-tuning transformer language models for clinical question answering. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 5532–5538.
  51. Galactica: A large language model for science. arXiv preprint arXiv:2211.09085.
  52. BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
  53. Opportunities and challenges for chatgpt and large language models in biomedicine and health. Briefings in Bioinformatics, 25(1):bbad493.
  54. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  55. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  56. Large language models should be used as scientific reasoning engines, not knowledge databases. Nature medicine, 29(12):2983–2984.
  57. An overview of the bioasq large-scale biomedical semantic indexing and question answering competition. BMC bioinformatics, 16(1):1–28.
  58. Fact or fiction: Verifying scientific claims. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7534–7550.
  59. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  60. Ccnet: Extracting high quality monolingual datasets from web crawl data. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4003–4012.
  61. Pmc-llama: Further finetuning llama on medical papers. arXiv preprint arXiv:2304.14454.
  62. Approximate nearest neighbor negative contrastive learning for dense text retrieval. In International Conference on Learning Representations.
  63. Gatortron: A large clinical language model to unlock patient information from unstructured electronic health records. arXiv preprint arXiv:2203.03540.
  64. Deep bidirectional language-knowledge graph pretraining. Advances in Neural Information Processing Systems, 35:37309–37323.
  65. Almanac—retrieval-augmented language models for clinical medicine. NEJM AI, 1(2):AIoa2300068.
  66. Pierre Zweigenbaum. 2003. Question answering in biomedicine. In Proceedings Workshop on Natural Language Processing for Question Answering, EACL, volume 2005, pages 1–4. Citeseer.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Guangzhi Xiong (18 papers)
  2. Qiao Jin (74 papers)
  3. Zhiyong Lu (113 papers)
  4. Aidong Zhang (49 papers)
Citations (85)