Evaluating Small Open LLMs for Medical Question Answering: A Practical Framework

Published 12 Apr 2026 in cs.IR and cs.CL | (2604.10535v1)

Abstract: Incorporating LLMs in medical question answering demands more than high average accuracy: a model that returns substantively different answers each time it is queried is not a reliable medical tool. Online health communities such as Reddit have become a primary source of medical information for millions of users, yet they remain highly susceptible to misinformation; deploying LLMs as assistants in these settings amplifies the need for output consistency alongside correctness. We present a practical, open-source evaluation framework for assessing small, locally-deployable open-weight LLMs on medical question answering, treating reproducibility as a first-class metric alongside lexical and semantic accuracy. Our pipeline computes eight quality metrics, including BERTScore, ROUGE-L, and an LLM-as-judge rubric, together with two within-model reproducibility metrics derived from repeated inference (N=10 runs per question). Evaluating three models (Llama 3.1 8B, Gemma 3 12B, MedGemma 1.5 4B) on 50 MedQuAD questions (N=1,500 total responses) reveals that despite low-temperature generation (T=0.2), self-agreement across runs reaches at most 0.20, while 87-97% of all outputs per model are unique -- a safety gap that single-pass benchmarks entirely miss. The clinically fine-tuned MedGemma 1.5 4B underperforms the larger general-purpose models on both quality and reproducibility; however, because MedGemma is also the smallest model, this comparison confounds domain fine-tuning with model scale. We describe the methodology in sufficient detail for practitioners to replicate or extend the evaluation for their own model-selection workflows. All code and data pipelines are available at https://github.com/aviad-buskila/llm_medical_reproducibility.

Abstract PDF Upgrade to Chat

Authors (1)

Avi-ad Avraam Buskila

Summary

The paper introduces a reproducibility-centric evaluation pipeline that quantifies both accuracy and output stability for small open LLMs in medical question answering.
It rigorously benchmarks models using metrics such as token-level F1, BERTScore, and self-agreement on a fixed MedQuAD subset to assess performance trade-offs.
Findings reveal significant trade-offs between lexical fidelity, reproducibility, and throughput, highlighting the need for post-hoc stabilization in clinical settings.

Evaluation of Small Open LLMs for Medical Question Answering: A Reproducibility-Centric Framework

Introduction

The integration of LLMs into medical QA, especially within online health communities, introduces unique requirements far beyond mean accuracy. The variability and stability of model-generated outputs are critical for clinical safety and end-user trust. This work presents a systematic, open-source evaluation pipeline focused explicitly on quantifying and benchmarking both accuracy and reproducibility of small, open-weight LLMs in medical QA scenarios. The framework accentuates reproducibility as a primary performance axis, operationalized through metrics such as self-agreement and response uniqueness, supplementing conventional lexical and semantic accuracy measures.

Methodology

Dataset and Experimental Setup

The evaluation leverages a fixed 50-question subset of MedQuAD, which is representative of clinically relevant consumer medical queries. To ensure experimental determinism and replicability, sampling uses a fixed seed. Each model is prompted ten times per question, yielding a high-variance assessment over 1,500 total responses. All models are deployed in a local environment via the Ollama runtime, eliminating confounds from external APIs and enabling control over inference settings.

Three open-weight models were selected:

Llama 3.1 8B: A canonical 8B parameter instruction-tuned model.
Gemma 3 12B: A 12B parameter general-purpose instruction-tuned model.
MedGemma 1.5 4B: A Gemma-derived, 4B parameter model with clinical fine-tuning.

A uniform, clinically constrained prompt minimizes prompt-based behavior differences. Generation hyperparameters are fixed ( $T=0.2$ , top- $p=1.0$ , max tokens $=512$ ).

Metrics

The framework evaluates models on 8 quality metrics and 2 reproducibility metrics:

Quality: Exact match, token-level F1, string similarity, BLEU, ROUGE-L, BERTScore, and a single-pass LLM-judge score using a locally-hosted 20B model.
Reproducibility: Self-agreement (modal output frequency) and response uniqueness (fraction of unique outputs), both normalized for formatting invariance.

Efficiency metrics (throughput, latency) are also logged.

The architecture separates generation and scoring, ensuring that post-hoc metric investigations do not reintroduce sampling variance or engineering overhead, and supports auditability through rigorous result logging.

Results

Model Accuracy

Semantic overlap, as captured by BERTScore (0.847–0.852), is tightly clustered across all models. Lexical metrics (Token F1) and the LLM-judge score differentiate the models:

Llama 3.1 8B leads in token-level F1 and BERTScore, reflecting stronger lexical alignment.
Gemma 3 12B attains the highest LLM-judge score, suggesting more holistic clinical appropriateness according to the rubric.
MedGemma 1.5 4B, although clinically fine-tuned, underperforms the larger baselines on both quality and reproducibility metrics. The 3x difference in parameter count confounds the attribution of this deficit to domain-adaptive fine-tuning.

No model ever achieves exact match with the gold answers, indicating all outputs are paraphrased and suggesting minimal risk of dataset leakage.

Output Stability

A critical—and striking—finding is the extremely low reproducibility of outputs:

Self-agreement rates range from 0.12 (Llama 3.1 8B) to 0.20 (Gemma 3 12B), meaning the modal output appears, on average, only about 1–2 out of 10 runs per question.
Output uniqueness is nearly maximal: 87–97% of responses per model are unique across repeated runs, even with aggressive normalization and low-temperature sampling.

Pairwise response overlap is zero across models, highlighting divergent output spaces.

Efficiency

Llama 3.1 8B achieves the fastest throughput (~43 tokens/sec), while MedGemma and Gemma are notably slower.

Quality–Reproducibility Tradeoff

A partial trade-off emerges: Llama 3.1 8B is fastest and most lexically faithful but least stable; Gemma 3 12B is most reproducible and achieves the highest judge score but at lower throughput; MedGemma trails on all axes.

Implications and Future Work

Clinical and Theoretical Implications

Output stability is not assured by low temperature alone, and single-pass benchmarks significantly overstate trustworthiness for safety-critical tasks. QA deployment in clinical contexts must incorporate post-hoc stabilization mechanisms (ensemble voting, confidence gating, or mandatory review) to mitigate stochasticity-induced risk.
Model scale dominates clinical fine-tuning under the tested conditions. Domain-adaptive training does not compensate for significant parameter deficits in these tasks. Therefore, empirical head-to-head baseline evaluation is mandatory prior to domain deployment, and comparative studies must eliminate model size as a confounder.
Use of an LLM judge, while scalable, introduces its own randomness. Holistic, rubric-based evaluation should incorporate repeated passes or deterministic alternates for high-stakes selections.

Prospects for Future Research

Prominent avenues for extension include:

Matched-scale comparisons (e.g., MedGemma 1.5 4B vs. Gemma 3 4B) to decouple domain specialization from raw capacity.
An expanded model set (including Phi-3, Mistral-Medical, Qwen-Med) and benchmarks (USMLE, MedQA, MIMIC-derived questions).
Systematic studies on ensemble stabilization strategies and their cost–benefit trade-offs.
External calibration of LLM-judge scores against expert clinician judgments.

Conclusion

This study operationalizes reproducibility as a critical and quantifiable axis in medical QA evaluation for small, open LLMs. The comprehensive pipeline and empirical findings establish that:

Low-temperature decoding does not ensure answer reproducibility, thus necessitating robust post-processing for any clinical deployment.
At these parameter budgets, clinical fine-tuning does not offset model scale with respect to either quality or reproducibility.
Trade-offs between quality, stability, and efficiency must be reckoned with relative to deployment requirements.

The open-source nature of the pipeline invites adoption and iterative refinement, with the intention that reproducibility metrics become a de facto standard for LLM benchmarking in high-stakes domains such as clinical NLP.

Markdown Report Issue