Papers
Topics
Authors
Recent
2000 character limit reached

Self-MedRAG: a Self-Reflective Hybrid Retrieval-Augmented Generation Framework for Reliable Medical Question Answering

Published 8 Jan 2026 in cs.IR and cs.AI | (2601.04531v1)

Abstract: LLMs have demonstrated significant potential in medical Question Answering (QA), yet they remain prone to hallucinations and ungrounded reasoning, limiting their reliability in high-stakes clinical scenarios. While Retrieval-Augmented Generation (RAG) mitigates these issues by incorporating external knowledge, conventional single-shot retrieval often fails to resolve complex biomedical queries requiring multi-step inference. To address this, we propose Self-MedRAG, a self-reflective hybrid framework designed to mimic the iterative hypothesis-verification process of clinical reasoning. Self-MedRAG integrates a hybrid retrieval strategy, combining sparse (BM25) and dense (Contriever) retrievers via Reciprocal Rank Fusion (RRF) to maximize evidence coverage. It employs a generator to produce answers with supporting rationales, which are then assessed by a lightweight self-reflection module using Natural Language Inference (NLI) or LLM-based verification. If the rationale lacks sufficient evidentiary support, the system autonomously reformulates the query and iterates to refine the context. We evaluated Self-MedRAG on the MedQA and PubMedQA benchmarks. The results demonstrate that our hybrid retrieval approach significantly outperforms single-retriever baselines. Furthermore, the inclusion of the self-reflective loop yielded substantial gains, increasing accuracy on MedQA from 80.00% to 83.33% and on PubMedQA from 69.10% to 79.82%. These findings confirm that integrating hybrid retrieval with iterative, evidence-based self-reflection effectively reduces unsupported claims and enhances the clinical reliability of LLM-based systems.

Summary

  • The paper presents an iterative hybrid retrieval-augmented generation framework that improves evidence alignment for reliable medical QA.
  • It combines BM25 and Contriever using Reciprocal Rank Fusion and employs a lightweight self-reflection critic for iterative rationale verification.
  • Empirical results on MedQA and PubMedQA benchmarks demonstrate significant accuracy gains, highlighting its potential for clinical decision support.

Self-MedRAG: An Iterative Hybrid RAG Architecture for Reliable Medical QA

Motivation and Problem Formulation

Self-MedRAG addresses a defining challenge in automated medical question answering: robust factual grounding in dynamic, high-stakes biomedical contexts. Conventional LLMs, despite their proficiency in synthesizing complex knowledge, persistently exhibit hallucinations and lack adaptive evidence alignment. Single-pass retrieval-augmented generation (RAG) frameworks enhance grounding, but fail to support iterative, multi-hop reasoning well-aligned with the clinical diagnostic process, resulting in unreliable or unsupported answers on complex queries.

System Architecture

Self-MedRAG is an integrated iterative pipeline operationalized through four primary modules: hybrid retrieval, answer generation, self-reflection (critic), and query refinement. The retrieval backbone employs Reciprocal Rank Fusion (RRF) to combine BM25 (lexical) and Contriever (semantic) retrievers, leveraging their complementary properties for maximal coverage of relevant biomedical evidence. The generator utilizes DeepSeek LLM, receiving as input structured prompts composed of the query, retrieved evidence, and optionally the multi-step reasoning history.

Each cycle produces a rationale-augmented answer, which is then subjected to rationale-level verification via a lightweight critic module. Two configurations are compared: RoBERTa-large-MNLI (NLI-based entailment) and Llama 3.1-8B (LLM-based verification). If the rationale support score—fraction of rationale statements entailed by retrieved evidence—falls below a calibrated threshold, unsupported rationales are isolated and used to reformulate the query, triggering a new retrieval/generation/verification cycle. The process halts when rationale sufficiency is achieved.

Empirical Results

Self-MedRAG demonstrates strong quantitative improvements when evaluated on two prominent benchmarks: MedQA (clinical diagnosis multiple-choice) and PubMedQA (research abstract-based inference). Key results include:

  • Hybrid retrieval via RRF nearly doubles MedQA accuracy over individual retrievers (BM25: 41.74%, Contriever: 43.30%, RRF: 80.00%) and yields meaningful gains on PubMedQA (BM25: 66.80%, Contriever: 67.90%, RRF: 69.10%).
  • Iterative self-reflection with NLI critic elevates accuracy from RRF baseline to 79.82% (PubMedQA) and 83.33% (MedQA), a nearly 10-point and 3-point absolute gain respectively.
  • The LLM-based critic performs comparably (PubMedQA: 78.76%, MedQA: 82.90%), with the majority of gains attributable to the iterative refinement process rather than critic architecture.
  • Performance saturates after two iterations, indicating diminishing returns with further cycles—initial rounds deliver the principal benefit in rationale verification and unsupported claim resolution.

Analysis of Retrieval and Verification Components

The pronounced improvement from hybrid fusion underscores the necessity of both precise keyword matching and semantically aligned passage retrieval in biomedical QA. Critique of single retriever performance (especially MedCPT) highlights limitations in embedding collapse and insufficient sensitivity to fine-grained biomedical distinctions, corroborating the importance of diverse retrieval signals. The iterative design mitigates unsupported generation typical of LLMs, not only by enforcing rationale-document entailment but also exposing ungrounded assumptions for explicit resolution in subsequent iterations.

RoBERTa-large-MNLI's marginal advantage over the LLM critic is consistent with its domain-optimized entailment training. However, both critics are outperformed by the framework's iterative verification mechanism itself, suggesting further optimization of rationale support thresholds and feedback loop design as future directions.

Implications and Prospects

Practically, Self-MedRAG advances the state-of-the-art in reliable AI-powered clinical decision support and medical research QA, making it possible to adapt to evolving medical evidence without model retraining. Its rationale-level verification pipeline provides transparent and traceable reasoning chains, critical for clinical auditability and regulatory compliance.

Theoretically, the work foregrounds the utility of explicit self-reflection for autonomous QA systems, opening the path toward agentic RAG architectures with autonomous planning and multi-tool verification capabilities. The demonstrated iterative evidence alignment sets a precedent for integrating lightweight critics in domain-specific QA workflows, reducing computational overhead relative to full-blown agentic frameworks.

Further extensions could involve domain-adapted factuality critics, the introduction of knowledge graph-based entity linking, and dynamic calibration of rationale verification thresholds per task and evidence quality. The platform is amenable to expansion with external tools (e.g., biomedical databases), and its modular design could accommodate deeper clinical reasoning models (e.g., iterative differential diagnosis).

Conclusion

Self-MedRAG operationalizes a hybrid, rationale-verifying iterative RAG approach, substantially improving the accuracy and clinical reliability of automated medical question answering. The fusion of sparse and dense retrieval, coupled with rationale-level self-reflection, delivers robust factual grounding, particularly for multi-step, evidence-synthesis tasks. These findings establish iterative, self-reflective hybrid RAG as a foundational methodology for reliable, transparent biomedical QA, recommending further investigation into critic design, agentic workflows, and adaptive evidence integration for next-generation clinical decision-support systems.

Whiteboard

Paper to Video (Beta)

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.