- The paper demonstrates that prompt engineering can drive competitive performance without supervised fine-tuning in clinical QA tasks.
- The study employs ensemble and task decomposition strategies to improve evidence extraction and patient-friendly answer generation.
- Comparative evaluations reveal open-source LLMs approach proprietary models, supporting GDPR-compliant deployments in healthcare.
Evaluating LLMs for Low-Resource Clinical Question Answering: BIT.UA-AAUBS at ArchEHR-QA 2026
Task Definition and Motivation
This paper addresses the challenge of clinical QA and evidence grounding directly from EHRs in an extreme low-resource context, as specified in the ArchEHR-QA 2026 shared task (2605.03618). The task is divided into four sequential subtasks: transforming verbose patient inquiries into concise clinical queries (Subtask 1), extracting relevant evidence from EHR notes (Subtask 2), generating patient-friendly answers with explicit grounding (Subtask 3), and aligning generated answers with supporting evidence (Subtask 4). The core methodological constraint is the absence of annotated training data, driven by strict privacy considerations under GDPR and HIPAA, precluding supervised fine-tuning or synthetic data generation. The authors therefore focus on prompt engineering strategies and model selection across both proprietary and open-source LLMs, evaluating their ability to bridge semantic gaps in highly specialized clinical QA tasks.
Methodological Framework
The methodology centers on the use of prompt engineering as a surrogate for supervised learning. Three major categories of prompts are utilized: constraint-based instructions to enforce structure and terminology, extraction-focused steps for isolating key information, and rephrasing strategies. Techniques employed include zero-shot prompting, few-shot in-context learning (ICL), lexical constraints, task decomposition, and chain-of-thought (CoT) pipelines. Ensemble aggregation via majority voting and LLM-as-a-judge frameworks is used to improve robustness and precision. All LLM components are executed with deterministic settings (temperature 0.0, top-p 0.95).
A systematic comparative evaluation is conducted between state-of-the-art proprietary models (e.g., Gemini, Claude, GPT-4 variants) and open-source architectures (e.g., Llama, Qwen, MedGemma), with prompt configurations specifically tailored for each subtask. Task decomposition consistently yields higher performance and clinical accuracy for Subtask 1, and ensemble methods provide stability in evidence retrieval (Subtask 2) and evidence-alignment (Subtask 4).
Results and Quantitative Evaluation
Subtask 1: Task decomposition prompts (extract-then-generate) substantially outperform direct generation approaches in internal validation. On the official leaderboard, the best performing proprietary configuration (Claude Sonnet 4.5) achieves a score of 19.0, ranked 13th, while open-source MedGemma models are competitive but not dominant. A notable discrepancy emerges between internal validation and leaderboard results, likely attributable to distribution shift or overfitting to the very small 20-case development set.
Subtask 2: Proprietary models dominate evidence extraction, with ensemble majority voting achieving a Strict Micro F1 of 58.8 (11th place), closely matching the median (59.8) and the best competitor (63.7). Strict minimality in prompt design lowers recall, and open-source models demonstrate high sensitivity to prompt variation, resulting in frequent parse failures.
Subtask 3: Patient-friendly answer generation is relatively robust to prompt variation; model selection is the primary driver of performance. The LLM-as-a-judge ensemble, composed of Gemini 2.5. Flash, Claude Sonnet 4.5, and Claude Opus 4.6, achieves a competitive score of 35.6 (3rd place). Open-source MedGemma-27B and domain-finetuned Qwen3-8B narrow the gap considerably.
Subtask 4: Evidence alignment exhibits minimal performance variance between ensemble, few-shot, and zero-shot configurations in proprietary models. The three-model ensemble achieves Micro F1 of 81.5 (1st place), with a small margin separating top submissions. Open-source MedGemma-27B achieves high scores (average 81.0), signaling practical viability for privacy-conscious clinical deployments.
Error Analysis
Error analysis reveals that task decomposition and constraint stacking occasionally induce negative interference, especially in Subtask 1 (question interpretation), leading to informal, poorly structured queries. In Subtask 2, over-selection (high false positives) and missed evidence (failure to capture treatments) remain unsolved, suggesting limitations in semantic reasoning. Subtask 3 suffers from hallucinations and context misinterpretation, highlighting the need for stricter grounding. Subtask 4 demonstrates difficulties in multi-evidence reasoning, with incomplete citation due to terminology mismatches.
Practical and Theoretical Implications
This study establishes that prompt engineering, when carefully applied, enables off-the-shelf LLMs to perform competitively (and even obtain state-of-the-art scores) in complex clinical QA subtasks without any weight updates or supervised fine-tuning. Proprietary models exhibit high robustness to prompt variations, with diminishing returns beyond baseline prompt optimization. Open-source, domain-adapted models like MedGemma 3 27B achieve near parity with closed-source alternatives, an essential development for real-world deployments under privacy constraints.
Ensembling techniques (majority voting, LLM-as-a-judge) deliver critical marginal gains for leaderboard competitiveness, but their operational costs and latency render them inefficient for scalable health informatics systems. Model selection remains the most effective lever, followed by careful prompt engineering for strict structural constraints.
The evidence supports three major conclusions:
- Domain-adapted open-source LLMs increasingly achieve performance near that of proprietary models, making them viable for GDPR-compliant clinical deployments.
- Prompt engineering is a critical lever in the absence of fine-tuning but has diminishing value in robust proprietary LLMs.
- Ensemble strategies are effective for competition, but impractical for real-time healthcare applications.
Limitations
The study is constrained by the extreme paucity of development samples (20 cases), increasing the risk of prompt and ensemble overfitting. Proprietary LLM evaluations are not exhaustive due to API cost constraints, and reproducibility is challenged by silent provider weight updates. The focused English-language evaluation limits generalizability to multilingual or non-English EHRs.
Ethical Considerations
Reliance on proprietary LLMs (requiring transmission of clinical data to external endpoints) is incompatible with privacy frameworks such as GDPR and HIPAA. Open-source models offer a pathway for local, secure deployment but still lag in some tasks. Generative LLMs in clinical settings must be strictly deployed as human-in-the-loop assistive tools due to persistent hallucination risks.
Conclusion
This paper demonstrates that prompt engineering and model selection enable LLMs to achieve highly competitive performance in low-resource clinical QA settings, securing top leaderboard positions in evidence citation alignment and patient-friendly answer generation. The narrowing gap between proprietary and open-source LLMs has significant implications for privacy-compliant healthcare AI, although evidence extraction and query formulation remain challenging for generative-only pipelines. Future work should address optimizing prompt strategies for open-source models, improving semantic reasoning and multi-evidence aggregation, and extending evaluation to multilingual clinical notes.