Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Rationale-Guided Retrieval Augmented Generation for Medical Question Answering (2411.00300v1)

Published 1 Nov 2024 in cs.CL
Rationale-Guided Retrieval Augmented Generation for Medical Question Answering

Abstract: LLMs (LLM) hold significant potential for applications in biomedicine, but they struggle with hallucinations and outdated knowledge. While retrieval-augmented generation (RAG) is generally employed to address these issues, it also has its own set of challenges: (1) LLMs are vulnerable to irrelevant or incorrect context, (2) medical queries are often not well-targeted for helpful information, and (3) retrievers are prone to bias toward the specific source corpus they were trained on. In this study, we present RAG$2$ (RAtionale-Guided RAG), a new framework for enhancing the reliability of RAG in biomedical contexts. RAG$2$ incorporates three key innovations: a small filtering model trained on perplexity-based labels of rationales, which selectively augments informative snippets of documents while filtering out distractors; LLM-generated rationales as queries to improve the utility of retrieved snippets; a structure designed to retrieve snippets evenly from a comprehensive set of four biomedical corpora, effectively mitigating retriever bias. Our experiments demonstrate that RAG$2$ improves the state-of-the-art LLMs of varying sizes, with improvements of up to 6.1\%, and it outperforms the previous best medical RAG model by up to 5.6\% across three medical question-answering benchmarks. Our code is available at https://github.com/dmis-lab/RAG2.

Insights into Rationale-Guided Retrieval Augmented Generation for Medical Question Answering

The paper "Rationale-Guided Retrieval Augmented Generation for Medical Question Answering" presents a refined framework, RAG$^2$, that aims to enhance the reliability of retrieval-augmented generation (RAG) models within biomedical contexts. This paper directly addresses inherent challenges faced by LLMs, particularly in domains requiring high accuracy, such as medicine, where hallucinations and outdated information can compromise outcomes.

Core Innovations in RAG$^2$

RAG$^2$ advances the current landscape of RAG through three distinctive innovations aimed at improving both reliability and contextual relevance within medical question-answering systems:

  1. Rationale-Guided Filtering: The framework introduces a filtering model trained on perplexity-based labels. By gauging the informativeness of retrieved snippets, it effectively curtails irrelevant or distracting information. This aspect is critical in medical applications where the correctness of information can significantly affect the actionable insights derived by healthcare professionals.
  2. Rationale-Based Queries: RAG$^2$ transitions from using original medical queries to LLM-generated rationales as inputs for retrieval tasks. This approach aids in expanding overly brief queries and narrowing down verbose ones, thereby enhancing the retrieval of pertinent information. This results in more focused searches that align query formulation closer to the domain-specific language of medical texts.
  3. Balanced Retrieval Strategy: To mitigate corpus bias, RAG$^2$ equally sources snippets from four key biomedical corpora, including PubMed, PMC, textbooks, and clinical guidelines. This method ensures a more inclusive utilization of information resources, offsetting biases that might arise from a lopsided emphasis on larger corpora.

Performance and Implications

RAG$^2$ has demonstrated a notable enhancement in performance across several benchmarks. It achieves up to a 6.1% improvement over state-of-the-art LLMs and outstrips the previous best medical RAG models by up to 5.6% in medical question-answering benchmarks, such as MedQA, MedMCQA, and MMLU-Med. This marked improvement underscores the potential of RAG$^2$ to elevate the reliability of automated medical response systems in high-stakes settings.

The paper's contributions extend beyond immediate improvements in medical QA to suggest a pathway for reducing the dependency on model retraining, thereby handling outdated knowledge more efficiently. By refining how LLMs retrieve and process information, particularly through the use of rationale-driven methodologies, RAG$^2$ posits a scalable solution that could adapt to evolving medical data without frequent and intensive retraining.

Theoretical and Practical Implications

Theoretically, this research enriches the understanding of retrieval dynamics in LLMs and highlights the importance of tailored input transformations and filtering mechanisms. Practically, for stakeholders in biomedical AI, the innovations in RAG$^2$ suggest advancements towards integrating complex AI systems more effectively within clinical workflows, where accuracy and trustworthiness are paramount.

Future Directions

While RAG$^2$ presents substantial advancements, the exploration of its adaptability and translational potential to other domains remains a promising area for future research. Given the nuanced challenges across different specialized fields, further studies could refine and tailor the retrieval and filtering strategies developed here to fit the unique requirements of other complex knowledge domains.

In conclusion, this paper lays a robust foundation for improving the application of LLMs in biomedicine through strategic innovation in retrieval-augmented generation methodologies. It addresses both procedural and theoretical challenges, enabling more reliable and contextually accurate AI systems that have significant implications for the future of medical information technology.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. AI@Meta. 2024. Llama 3 model card.
  2. Self-rag: Learning to retrieve, generate, and critique through self-reflection. In The Twelfth International Conference on Learning Representations.
  3. Evaluating entity disambiguation and the role of popularity in retrieval-based nlp. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4472–4485.
  4. Meditron-70b: Scaling medical pretraining for large language models. arXiv preprint arXiv:2311.16079.
  5. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1–53.
  6. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pages 758–759.
  7. Neural retrievers are biased towards llm-generated content. In ICLR 2024 Workshop: How Far Are We From AGI.
  8. Similarity is not all you need: Endowing retrieval augmented generation with multi layered thoughts. arXiv preprint arXiv:2405.19893.
  9. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997.
  10. Measuring massive multitask language understanding. In International Conference on Learning Representations.
  11. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232.
  12. Query expansion by prompting large language models. arXiv preprint arXiv:2305.03653.
  13. Improving medical reasoning through retrieval and self-reflection with retrieval-augmented large language models. Bioinformatics.
  14. Adaptive-rag: Learning to adapt retrieval-augmented large language models through question complexity. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 7029–7043.
  15. Mistral 7b. arXiv preprint arXiv:2310.06825.
  16. Active retrieval augmented generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7969–7992.
  17. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421.
  18. Medcpt: Contrastive pre-trained transformers with large-scale pubmed search logs for zero-shot biomedical information retrieval. Bioinformatics, 39(11):btad651.
  19. Knowledge-augmented reasoning distillation for small language models in knowledge-intensive tasks. Advances in Neural Information Processing Systems, 36.
  20. Realtime qa: what’s the answer right now? Advances in Neural Information Processing Systems, 36.
  21. Small language models learn enhanced reasoning skills from medical textbooks. arXiv preprint arXiv:2404.00376.
  22. Automatic creation of named entity recognition datasets by querying phrase representations. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7148–7163.
  23. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626.
  24. BioMistral: A collection of open-source pretrained large language models for medical domains. In Findings of the Association for Computational Linguistics ACL 2024, pages 5848–5864, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
  25. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
  26. Bailicai: A domain-optimized retrieval-augmented generation framework for medical applications. arXiv preprint arXiv:2407.21055.
  27. On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919.
  28. OpenAI. 2022. Introducing chatgpt.
  29. OpenAI. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 1.
  30. OpenAI. 2024. Hello gpt-4o.
  31. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Conference on health, inference, and learning, pages 248–260. PMLR.
  32. Lin CY Rouge. 2004. A package for automatic evaluation of summaries. In Proceedings of Workshop on Text Summarization of ACL, Spain, volume 5.
  33. Capabilities of gemini models in medicine. arXiv preprint arXiv:2404.18416.
  34. Ares: An automated evaluation framework for retrieval-augmented generation systems. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 338–354.
  35. Large language models encode clinical knowledge. Nature, 620(7972):172–180.
  36. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  37. Query2doc: Query expansion with large language models. In The 2023 Conference on Empirical Methods in Natural Language Processing.
  38. Learning to filter context for retrieval-augmented generation. arXiv preprint arXiv:2311.08377.
  39. Rat: Retrieval augmented thoughts elicit context-aware reasoning in long-horizon generation. arXiv preprint arXiv:2403.05313.
  40. Speculative rag: Enhancing retrieval augmented generation through drafting. arXiv preprint arXiv:2407.08223.
  41. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837.
  42. Instructrag: Instructing retrieval-augmented generation with explicit denoising. arXiv preprint arXiv:2406.13629.
  43. Clasheval: Quantifying the tug-of-war between an llm’s internal prior and external evidence. arXiv preprint arXiv:2404.10198.
  44. Benchmarking retrieval-augmented generation for medicine. In Findings of the Association for Computational Linguistics ACL 2024, pages 6233–6251, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
  45. Improving retrieval-augmented generation in medicine with iterative follow-up questions. arXiv preprint arXiv:2408.00727.
  46. Corrective retrieval augmented generation. arXiv preprint arXiv:2401.15884.
  47. Seakr: Self-aware knowledge retrieval for adaptive retrieval augmented generation. arXiv preprint arXiv:2406.19215.
  48. Almanac—retrieval-augmented language models for clinical medicine. NEJM AI, 1(2):AIoa2300068.
  49. Raft: Adapting language model to domain specific rag. arXiv preprint arXiv:2403.10131.
  50. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
  51. How do large language models capture the ever-changing world knowledge? a review of recent advances. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 8289–8311.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Jiwoong Sohn (6 papers)
  2. Yein Park (3 papers)
  3. Chanwoong Yoon (10 papers)
  4. Sihyeon Park (3 papers)
  5. Hyeon Hwang (8 papers)
  6. Mujeen Sung (20 papers)
  7. Hyunjae Kim (25 papers)
  8. Jaewoo Kang (83 papers)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub

X Twitter Logo Streamline Icon: https://streamlinehq.com