- The paper introduces a reproducibility-centric evaluation pipeline that quantifies both accuracy and output stability for small open LLMs in medical question answering.
- It rigorously benchmarks models using metrics such as token-level F1, BERTScore, and self-agreement on a fixed MedQuAD subset to assess performance trade-offs.
- Findings reveal significant trade-offs between lexical fidelity, reproducibility, and throughput, highlighting the need for post-hoc stabilization in clinical settings.
Evaluation of Small Open LLMs for Medical Question Answering: A Reproducibility-Centric Framework
Introduction
The integration of LLMs into medical QA, especially within online health communities, introduces unique requirements far beyond mean accuracy. The variability and stability of model-generated outputs are critical for clinical safety and end-user trust. This work presents a systematic, open-source evaluation pipeline focused explicitly on quantifying and benchmarking both accuracy and reproducibility of small, open-weight LLMs in medical QA scenarios. The framework accentuates reproducibility as a primary performance axis, operationalized through metrics such as self-agreement and response uniqueness, supplementing conventional lexical and semantic accuracy measures.
Methodology
Dataset and Experimental Setup
The evaluation leverages a fixed 50-question subset of MedQuAD, which is representative of clinically relevant consumer medical queries. To ensure experimental determinism and replicability, sampling uses a fixed seed. Each model is prompted ten times per question, yielding a high-variance assessment over 1,500 total responses. All models are deployed in a local environment via the Ollama runtime, eliminating confounds from external APIs and enabling control over inference settings.
Three open-weight models were selected:
A uniform, clinically constrained prompt minimizes prompt-based behavior differences. Generation hyperparameters are fixed (T=0.2, top-p=1.0, max tokens =512).
Metrics
The framework evaluates models on 8 quality metrics and 2 reproducibility metrics:
- Quality: Exact match, token-level F1, string similarity, BLEU, ROUGE-L, BERTScore, and a single-pass LLM-judge score using a locally-hosted 20B model.
- Reproducibility: Self-agreement (modal output frequency) and response uniqueness (fraction of unique outputs), both normalized for formatting invariance.
Efficiency metrics (throughput, latency) are also logged.
The architecture separates generation and scoring, ensuring that post-hoc metric investigations do not reintroduce sampling variance or engineering overhead, and supports auditability through rigorous result logging.
Results
Model Accuracy
Semantic overlap, as captured by BERTScore (0.847โ0.852), is tightly clustered across all models. Lexical metrics (Token F1) and the LLM-judge score differentiate the models:
- Llama 3.1 8B leads in token-level F1 and BERTScore, reflecting stronger lexical alignment.
- Gemma 3 12B attains the highest LLM-judge score, suggesting more holistic clinical appropriateness according to the rubric.
- MedGemma 1.5 4B, although clinically fine-tuned, underperforms the larger baselines on both quality and reproducibility metrics. The 3x difference in parameter count confounds the attribution of this deficit to domain-adaptive fine-tuning.
No model ever achieves exact match with the gold answers, indicating all outputs are paraphrased and suggesting minimal risk of dataset leakage.
Output Stability
A criticalโand strikingโfinding is the extremely low reproducibility of outputs:
- Self-agreement rates range from 0.12 (Llama 3.1 8B) to 0.20 (Gemma 3 12B), meaning the modal output appears, on average, only about 1โ2 out of 10 runs per question.
- Output uniqueness is nearly maximal: 87โ97% of responses per model are unique across repeated runs, even with aggressive normalization and low-temperature sampling.
Pairwise response overlap is zero across models, highlighting divergent output spaces.
Efficiency
Llama 3.1 8B achieves the fastest throughput (~43 tokens/sec), while MedGemma and Gemma are notably slower.
QualityโReproducibility Tradeoff
A partial trade-off emerges: Llama 3.1 8B is fastest and most lexically faithful but least stable; Gemma 3 12B is most reproducible and achieves the highest judge score but at lower throughput; MedGemma trails on all axes.
Implications and Future Work
Clinical and Theoretical Implications
- Output stability is not assured by low temperature alone, and single-pass benchmarks significantly overstate trustworthiness for safety-critical tasks. QA deployment in clinical contexts must incorporate post-hoc stabilization mechanisms (ensemble voting, confidence gating, or mandatory review) to mitigate stochasticity-induced risk.
- Model scale dominates clinical fine-tuning under the tested conditions. Domain-adaptive training does not compensate for significant parameter deficits in these tasks. Therefore, empirical head-to-head baseline evaluation is mandatory prior to domain deployment, and comparative studies must eliminate model size as a confounder.
- Use of an LLM judge, while scalable, introduces its own randomness. Holistic, rubric-based evaluation should incorporate repeated passes or deterministic alternates for high-stakes selections.
Prospects for Future Research
Prominent avenues for extension include:
- Matched-scale comparisons (e.g., MedGemma 1.5 4B vs. Gemma 3 4B) to decouple domain specialization from raw capacity.
- An expanded model set (including Phi-3, Mistral-Medical, Qwen-Med) and benchmarks (USMLE, MedQA, MIMIC-derived questions).
- Systematic studies on ensemble stabilization strategies and their costโbenefit trade-offs.
- External calibration of LLM-judge scores against expert clinician judgments.
Conclusion
This study operationalizes reproducibility as a critical and quantifiable axis in medical QA evaluation for small, open LLMs. The comprehensive pipeline and empirical findings establish that:
- Low-temperature decoding does not ensure answer reproducibility, thus necessitating robust post-processing for any clinical deployment.
- At these parameter budgets, clinical fine-tuning does not offset model scale with respect to either quality or reproducibility.
- Trade-offs between quality, stability, and efficiency must be reckoned with relative to deployment requirements.
The open-source nature of the pipeline invites adoption and iterative refinement, with the intention that reproducibility metrics become a de facto standard for LLM benchmarking in high-stakes domains such as clinical NLP.