MuSeR: Enhancing Medical Context in LLMs
- MuSeR is a data-driven framework that refines LLM responses using facet-specific self-evaluation to address context gaps in medical queries.
- It integrates synthetic query generation, a three-stage Gen–Eval–Refine pipeline, and knowledge distillation to enhance decision-making and communication.
- Empirical results on HealthBench show significant gains in context-awareness and safety, outperforming baseline models in realistic clinical settings.
Multifaceted Self-Refinement (MuSeR) is a data-driven framework designed to address the limitations of LLMs in real-world medical scenarios, particularly their underperformance in context-awareness relative to human clinicians. By synthesizing diverse user contexts, leveraging facet-specific self-evaluation, and employing a structured answer refinement process, MuSeR systematically enhances an LLM’s ability to recognize and fill context gaps, tailor communication, and improve overall clinical safety (Zhou et al., 13 Nov 2025).
1. Motivation and Problem Scope
Standard medical QA benchmarks typically present well-formed questions containing all relevant contextual details, such as age, medical history, and comorbidities. This fails to capture the complexity and “messiness” of actual clinical interactions, which often omit critical facts or introduce ambiguity. Off-the-shelf LLMs address each prompt in isolation and, as a result, struggle with:
- Recognizing missing or ambiguous information required for safe recommendations,
- Adapting language and detail to different users (e.g., layperson, clinician, or caregiver),
- Identifying and explicitly flagging medical and ethical risks.
MuSeR operationalizes the insight that context sensitivity can be bootstrapped through self-refinement: by simulating realistic, noisy queries, having the model critique its own answers along multiple axes, and incorporating these structured critiques in subsequent supervised fine-tuning, an LLM can internalize and operationalize habits crucial for human-level medical context-awareness (Zhou et al., 13 Nov 2025).
2. Faceted Context-Awareness: Definitions and Sub-Metrics
MuSeR decomposes “context-awareness” into three orthogonal but complementary facets, each mapped to implicit sub-metrics guiding the self-evaluation stage:
| Facet | Objective | Implicit Sub-Metrics |
|---|---|---|
| Decision-Making (f₁) | Identify missing or ambiguous details essential to safety | Completeness of info, follow-up question generation |
| Communication (f₂) | Adapt terminology, tone, and detail to user identity and preferences | Register, brevity vs. depth, coherence with user style |
| Safety (f₃) | Flag potential clinical risks or ethical boundaries | Harm avoidance, uncertainty signaling, SoC reference |
Each facet is mapped to axes in the HealthBench rubric but is repurposed in MuSeR to drive model-internal critique and revision, not just final scoring (Zhou et al., 13 Nov 2025).
3. Synthetic Attribute-Conditioned Query Generation
Real-world medical queries are modeled as samples from an unknown latent distribution . MuSeR constructs an approximate generator parameterized by a vector of discrete attributes , formalizing:
Where is a manually defined prior over attributes (e.g., Role: Patient 0.7, Caregiver 0.2, Doctor 0.1). A smaller open-source LLM (DeepSeek-V3) is prompted with templates enumerating attribute values to generate realistic queries. This process is repeated at scale () to yield a broad synthetic dataset reflecting realistic variations in medical query formulation (Zhou et al., 13 Nov 2025).
4. Three-Stage Gen–Eval–Refine Pipeline
Given a synthetic query , the backbone LLM executes a three-stage procedure:
- Initial Draft Generation:
is the chain-of-thought (reasoning trace); is the initial response.
- Facet-Wise Self-Evaluation:
For each facet , the model generates a critique : For example, for : "We lack the patient’s medication history."
- Direct Answer Refinement:
The model refines the initial answer based on aggregated critique. Explicit, facet-driven refinement leads to improved performance over naive continuation of chain-of-thought (+2.9% overall, +6.3% hard cases on HealthBench).
No explicit self-refinement loss is minimized, but the refinement process can be viewed as aligning with a hypothetical oracle via KL minimization, constrained by the three facet evaluations (Zhou et al., 13 Nov 2025).
5. Supervised Fine-Tuning and Knowledge Distillation
After assembling tuples —where is the concatenation of reasoning and facet critiques, and the refined answer—MuSeR applies standard next-token SFT with cross-entropy loss:
Training is conducted for 6 epochs using AdamW (lr=, batch 16, cosine scheduler, 10% warm-up).
To further enhance medical domain knowledge, MuSeR employs an initial query-guided knowledge distillation stage in which answers from a stronger teacher model (GPT-OSS-120B) supervise the backbone (e.g., Qwen3-32B) using KL divergence at temperature . This procedure boosts baseline HealthBench performance from 46.1% to 56.6% prior to multifaceted self-refinement (Zhou et al., 13 Nov 2025).
6. Empirical Outcomes and Axial Analysis
Evaluation is conducted on the HealthBench dataset (5,000 physician-annotated dialogues, 7 themes, 5 axes—accuracy, completeness, context-awareness, communication, instruction-following; hard subset of 1,000 challenging cases):
| Model | HealthBench (%) | Hard Subset (%) |
|---|---|---|
| Qwen3-32B (base) | 46.1 | 12.0 |
| + Query-KD | 56.6 | 31.5 |
| + KD + MuSeR | 63.8 | 43.1 |
| Teacher (OSS-120B) | 57.6 | — |
Notably, MuSeR yields a +19.4% increase on the context-awareness axis, with theme-specific gains in “context seeking” (+7.6%), “global health” (+5.0%), and “hedging” (+4.0%). Ablations confirm decision-making awareness () is most impactful (removal costs 2.7% performance). Qwen3-32B+MuSeR establishes a new state of the art for open-source models, exceeding its own teacher on HealthBench (Zhou et al., 13 Nov 2025).
7. Insights, Limitations, and Prospective Work
- Both the knowledge distillation and self-refinement stages are necessary: the former for raw medical proficiency, the latter for context-awareness.
- Direct, facet-driven answer refinement outperforms chain-of-thought continuation.
- The procedure’s cost-effectiveness and scalability stem from its reliance on synthetic queries and model-internal critique, without requiring private EHR or real patient data.
Current limitations include dependence on synthetic queries and the foundation model’s pre-existing knowledge. There remains an approximate 3% performance gap to the top closed-model (e.g., GPT-5). Future directions outlined involve extending the multifaceted refinement schema to other high-stakes domains (legal, financial), integrating RLHF or real-world dialogue signals for richer self-evaluation, and domain-specific pretraining to close residual gaps (Zhou et al., 13 Nov 2025).
MuSeR demonstrates that explicit, multi-criteria self-refinement—when scaled across synthetic yet realistic contexts—substantially enhances LLMs’ context-awareness for medical applications, and offers a general recipe for context-sensitive LLM adaptation across domains.