Improving Medical Reasoning through Retrieval and Self-Reflection with Retrieval-Augmented Large Language Models (2401.15269v3)

Published 27 Jan 2024 in cs.CL, cs.AI, and cs.IR

Abstract: Recent proprietary LLMs, such as GPT-4, have achieved a milestone in tackling diverse challenges in the biomedical domain, ranging from multiple-choice questions to long-form generations. To address challenges that still cannot be handled with the encoded knowledge of LLMs, various retrieval-augmented generation (RAG) methods have been developed by searching documents from the knowledge corpus and appending them unconditionally or selectively to the input of LLMs for generation. However, when applying existing methods to different domain-specific problems, poor generalization becomes apparent, leading to fetching incorrect documents or making inaccurate judgments. In this paper, we introduce Self-BioRAG, a framework reliable for biomedical text that specializes in generating explanations, retrieving domain-specific documents, and self-reflecting generated responses. We utilize 84k filtered biomedical instruction sets to train Self-BioRAG that can assess its generated explanations with customized reflective tokens. Our work proves that domain-specific components, such as a retriever, domain-related document corpus, and instruction sets are necessary for adhering to domain-related instructions. Using three major medical question-answering benchmark datasets, experimental results of Self-BioRAG demonstrate significant performance gains by achieving a 7.2% absolute improvement on average over the state-of-the-art open-foundation model with a parameter size of 7B or less. Overall, we analyze that Self-BioRAG finds the clues in the question, retrieves relevant documents if needed, and understands how to answer with information from retrieved documents and encoded knowledge as a medical expert does. We release our data and code for training our framework components and model weights (7B and 13B) to enhance capabilities in biomedical and clinical domains.

PDF HTML Abstract

Introduction

LLMs like GPT-4 have made significant contributions to fields such as biomedical text analysis. However, when tackling domain-specific problems, general retrieval-augmented generation (RAG) methods often exhibit poor generalization, resulting in incorrect document retrieval or judgment. In response, a new framework has been introduced, Self-BioRAG, aimed at improving medical reasoning through retrieval and self-reflection specifically tailored for the biomedical domain.

Background

The benchmark for Self-BioRAG's performance is established against well-known proprietary and open-source LLMs. While proprietary LLMs have advanced in instruction tuning, their limited access poses a challenge for researchers in the biomedical area. Open-source alternatives like the LLaMA family, consequently, have garnered interest, albeit with room for domain-specific improvements. Furthermore, Self-RAG, a model designed for cost-efficient retrieval with reflective tokens capable of self-assessment, was previously introduced but found inadequate for biomedical queries.

Self-BioRAG Framework

The construction of Self-BioRAG encompasses four integral components: instructional sets pertinent to biomedical and clinical texts, a specialized retriever for this domain, a self-reflection LLM which can critique its own outputs, and a domain-specific, instruction-tuned generator LLM. Instruction sets were crafted with existing biomedical tasks in mind, and a robust biomedical corpus was compiled, featuring medical textbooks, PubMed abstracts, clinical guidelines, and PMC full-text articles. The training processes for each LLM imitate that of Self-RAG, with an additional refining step using data from domain-specific components to improve model performance.

Experimental Results

Experimentation across three major medical question-answering benchmark datasets demonstrated Self-BioRAG's significant performance gains, outperforming the SOTA model average by a substantial 7.2% for languages models with parameters of 7 billion or less. The framework, tested in various configurations, indicates that factual content retrieval is most effective when done adaptively depending on the question. Furthermore, the retriever's bias towards Medical Textbooks suggests a strong relevance to USMLE-style questions, indicating a nuanced understanding of biomedical knowledge needs. The contribution of domain-specific components undeniably underpins the model's heightened performance, and the authors have made the code and data available for public use to further capabilities in this vertical.

Conclusion

Self-BioRAG represents a significant stride towards enhancing LLMs' effectiveness in the biomedical field. Its domain-specific components prove fundamental in enabling the model to autonomously appraise its generated responses and explanations, closely mimicking medical expert behavior. Limitations, such as performance drops in closed-domain datasets where excessive noise hampers the model, indicate potential areas for future refinement. By addressing these challenges, further advancements could lead to LLMs that not only generate but critique and improve upon their knowledge-based outputs in highly specialized fields such as medicine.

PDF Markdown Bookmark Chat (Pro)

References (47)

Authors (4)

Minbyul Jeong (18 papers)
Jiwoong Sohn (6 papers)
Mujeen Sung (20 papers)
Jaewoo Kang (83 papers)

Citations (18)

View on Semantic Scholar

Tweets

https://twitter.com/_reachsumit/status/1752164991929876791

https://twitter.com/gm8xx8/status/1752165403910897691

https://twitter.com/xwestein/status/1754598832355369187