A Systematic Study of Retrieval Pipeline Design for Retrieval-Augmented Medical Question Answering

Published 8 Apr 2026 in cs.CL, cs.AI, and cs.LG | (2604.07274v1)

Abstract: LLMs have demonstrated strong capabilities in medical question answering; however, purely parametric models often suffer from knowledge gaps and limited factual grounding. Retrieval-augmented generation (RAG) addresses this limitation by integrating external knowledge retrieval into the reasoning process. Despite increasing interest in RAG-based medical systems, the impact of individual retrieval components on performance remains insufficiently understood. This study presents a systematic evaluation of retrieval-augmented medical question answering using the MedQA USMLE benchmark and a structured textbook-based knowledge corpus. We analyze the interaction between LLMs, embedding models, retrieval strategies, query reformulation, and cross-encoder reranking within a unified experimental framework comprising forty configurations. Results show that retrieval augmentation significantly improves zero-shot medical question answering performance. The best-performing configuration was dense retrieval with query reformulation and reranking achieved 60.49% accuracy. Domain-specialized LLMs were also found to better utilize retrieved medical evidence than general-purpose models. The analysis further reveals a clear tradeoff between retrieval effectiveness and computational cost, with simpler dense retrieval configurations providing strong performance while maintaining higher throughput. All experiments were conducted on a single consumer-grade GPU, demonstrating that systematic evaluation of retrieval-augmented medical QA systems can be performed under modest computational resources.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper demonstrates that dense retrieval combined with query reformulation significantly boosts zero-shot QA accuracy on the MedQA benchmark.
The systematic analysis reveals that domain-specialized LLMs outperform general models across various pipeline configurations, emphasizing the value of biomedical pretraining.
The study quantifies computational tradeoffs, showing that while full pipeline features increase inference time, dense retrieval alone achieves strong performance with higher throughput.

Systematic Evaluation of Retrieval Pipeline Design for Medical RAG QA

Introduction

The paper "A Systematic Study of Retrieval Pipeline Design for Retrieval-Augmented Medical Question Answering" (2604.07274) provides a rigorous empirical analysis of retrieval-augmented generation (RAG) architectures for medical question answering (QA) using the MedQA USMLE benchmark and structured textbook corpora. Through exhaustive experimentation, the study dissects the contributions of key retrieval pipeline components—including retrieval strategies, embedding and reranking models, and query reformulation mechanisms—on system performance, resource utilization, and domain adaptation.

Methodological Overview

A comprehensive experimental framework was employed, consisting of 40 pipeline configurations that span combinations of two instruction-tuned LLMs (LLaMA3-Med42-8B for medical specialization and Gemma3 for general-purpose) and two embedding models (domain-specific MedEmbed and general-purpose BGE). The pipelines evaluated dense and hybrid (dense plus BM25-based sparse) retrieval, query reformulation, cross-encoder reranking, and both zero-shot and chain-of-thought (CoT) prompting regimes.

The curated knowledge base comprises medical textbooks, preprocessed through a structure-preserving chunking pipeline that maintains semantic and hierarchical integrity at the passage level, ensuring clinical context coherence for retrieval. Vector indexing used FAISS for dense representations, and Reciprocal Rank Fusion was applied in hybrid retrieval scenarios. All configurations were exhaustively evaluated on a single consumer-grade GPU, measuring both exact-match accuracy and throughput.

Empirical Findings

The experimental analysis delivers several strong results and technical insights:

Retrieval augmentation yields statistically significant improvement in zero-shot QA accuracy—dense retrieval with query reformulation and reranking attains $60.49\%$ accuracy on MedQA, representing a +4.95 point gain over the LLaMA-Med42 zero-shot baseline (McNemar's $p=0.00027$ ).
Domain-specialized LLMs (LLaMA-Med42) consistently outperform general models (Gemma3) across retrieval settings, confirming the value of biomedical pretraining for evidence utilization and answer generation.
Query reformulation and cross-encoder reranking are the most effective retrieval pipeline additions, driving the largest marginal gains in accuracy (+0.91 and +1.35 percentage points, respectively), albeit with moderate increases in computational demand.
Hybrid dense-sparse retrieval showed degraded performance and dramatically increased runtime on this structured corpus, suggesting minimal marginal utility for sparse lexical matching where semantic embedding coverage and passage structure are high.
CoT prompting notably improves baseline accuracy but interacts minimally with retrieval augmentation: although CoT on LLaMA-Med42 (without retrieval) improves accuracy to approximately 59.7%, adding retrieval under CoT has a limited further effect (+0.63 percentage points, not significant).
Computational tradeoffs are sharply quantified: enabling full pipeline features (dense retrieval, query reformulation, reranking) approximately quadruples inference time compared to NO RAG baselines, but runtime-efficient variants with just dense retrieval maintain strong accuracy at much higher throughput.

Domain-specific embedding models (MedEmbed) marginally outperformed general ones (BGE) in matched settings, particularly when combined with reranking, but the effect size was modest and sensitive to the overall pipeline configuration.

Implications and Theoretical Impact

This work provides substantial new clarity regarding optimal RAG system design for medical QA. Notably, the experimental results robustly challenge assumptions about the universal utility of hybrid retrieval—showing its inefficiency in tasks grounded in hierarchically structured, concept-dense corpora. Theoretically, the study's results reinforce the necessity of domain specialization in both LLMs and embedding models and empirically validate the importance of aligning query formulation with the ontology and terminology of the knowledge base.

From a systems perspective, the findings support a hierarchical, pragmatically selective approach to retrieval augmentation: dense semantic retrieval, augmented by LLM-driven query reformulation and reranker-based filtering, achieves nearly all of the possible accuracy improvements, with diminishing returns for additional pipeline complexity. The results also demonstrate that careful passage chunking, context window limitation, and evidence selection are more impactful than simply maximizing context length or retrieval set size, supporting recent theoretical discussions on evidence grounding and information overload in LLM inference.

These results have immediate practical implications: robust retrieval-augmented medical QA can be developed and meaningfully benchmarked using affordable hardware, lowering the barrier to entry for academic and clinical research groups not possessing high-end compute clusters.

Future Directions

Several areas are open for future technical investigation:

Extension of pipeline designs to less-structured corpora (e.g., biomedical literature, clinical notes) to reassess the value of hybrid or adaptive retrieval under higher knowledge heterogeneity.
Integration with more advanced adaptive retrieval mechanisms, potentially incorporating feedback loops, dynamic evidence weighting, or selective reranking.
Evaluation on free-form generative QA and real-world information-seeking dialogs to move beyond multiple-choice benchmarks.
Exploration of scalable, distributed evaluation frameworks to support even broader systematic benchmarking.

Conclusion

This systematic study establishes robust empirical guidelines for the design of retrieval-augmented medical QA systems. Dense retrieval with domain-specialized components—augmented by targeted query reformulation and reranking—delivers the highest factual accuracy and resource efficiency for structured clinical QA. The analysis clarifies the computational and theoretical tradeoffs of retrieval component choices and provides a reproducible foundation for further research on clinical RAG architectures (2604.07274).

Markdown Report Issue