MuSeR: Enhancing Medical Context in LLMs

Updated 20 November 2025

MuSeR is a data-driven framework that refines LLM responses using facet-specific self-evaluation to address context gaps in medical queries.
It integrates synthetic query generation, a three-stage Gen–Eval–Refine pipeline, and knowledge distillation to enhance decision-making and communication.
Empirical results on HealthBench show significant gains in context-awareness and safety, outperforming baseline models in realistic clinical settings.

Multifaceted Self-Refinement (MuSeR) is a data-driven framework designed to address the limitations of LLMs in real-world medical scenarios, particularly their underperformance in context-awareness relative to human clinicians. By synthesizing diverse user contexts, leveraging facet-specific self-evaluation, and employing a structured answer refinement process, MuSeR systematically enhances an LLM’s ability to recognize and fill context gaps, tailor communication, and improve overall clinical safety (Zhou et al., 13 Nov 2025).

1. Motivation and Problem Scope

Standard medical QA benchmarks typically present well-formed questions containing all relevant contextual details, such as age, medical history, and comorbidities. This fails to capture the complexity and “messiness” of actual clinical interactions, which often omit critical facts or introduce ambiguity. Off-the-shelf LLMs address each prompt in isolation and, as a result, struggle with:

Recognizing missing or ambiguous information required for safe recommendations,
Adapting language and detail to different users (e.g., layperson, clinician, or caregiver),
Identifying and explicitly flagging medical and ethical risks.

MuSeR operationalizes the insight that context sensitivity can be bootstrapped through self-refinement: by simulating realistic, noisy queries, having the model critique its own answers along multiple axes, and incorporating these structured critiques in subsequent supervised fine-tuning, an LLM can internalize and operationalize habits crucial for human-level medical context-awareness (Zhou et al., 13 Nov 2025).

2. Faceted Context-Awareness: Definitions and Sub-Metrics

MuSeR decomposes “context-awareness” into three orthogonal but complementary facets, each mapped to implicit sub-metrics guiding the self-evaluation stage:

Facet	Objective	Implicit Sub-Metrics
Decision-Making (f₁)	Identify missing or ambiguous details essential to safety	Completeness of info, follow-up question generation
Communication (f₂)	Adapt terminology, tone, and detail to user identity and preferences	Register, brevity vs. depth, coherence with user style
Safety (f₃)	Flag potential clinical risks or ethical boundaries	Harm avoidance, uncertainty signaling, SoC reference

Each facet is mapped to axes in the HealthBench rubric but is repurposed in MuSeR to drive model-internal critique and revision, not just final scoring (Zhou et al., 13 Nov 2025).

3. Synthetic Attribute-Conditioned Query Generation

Real-world medical queries are modeled as samples from an unknown latent distribution $P^*(q)$ . MuSeR constructs an approximate generator $G$ parameterized by a vector of discrete attributes $a = (\text{role},\ \text{region},\ \text{disease code},\ \text{intent},\ \text{vagueness},\ \text{completeness},\ \text{style})$ , formalizing:

$P^*(q) \approx P_G(q) = \sum_a P_{\text{Attr}}(a) \cdot G(q|a)$

Where $P_{\text{Attr}}(a)$ is a manually defined prior over attributes (e.g., Role: Patient 0.7, Caregiver 0.2, Doctor 0.1). A smaller open-source LLM (DeepSeek-V3) is prompted with templates enumerating attribute values to generate realistic queries. This process is repeated at scale ( $N=100\,000$ ) to yield a broad synthetic dataset reflecting realistic variations in medical query formulation (Zhou et al., 13 Nov 2025).

4. Three-Stage Gen–Eval–Refine Pipeline

Given a synthetic query $q$ , the backbone LLM $M$ executes a three-stage procedure:

Initial Draft Generation:

$(t_0, r_0) \leftarrow \text{Gen}(M, q)$ $t_0$ is the chain-of-thought (reasoning trace); $G$ 0 is the initial response.

Facet-Wise Self-Evaluation:

For each facet $G$ 1, the model generates a critique $G$ 2: $G$ 3 For example, for $G$ 4: "We lack the patient’s medication history."

Direct Answer Refinement:

$G$ 5 The model refines the initial answer based on aggregated critique. Explicit, facet-driven refinement leads to improved performance over naive continuation of chain-of-thought (+2.9% overall, +6.3% hard cases on HealthBench).

No explicit self-refinement loss is minimized, but the refinement process can be viewed as aligning $G$ 6 with a hypothetical oracle $G$ 7 via KL minimization, constrained by the three facet evaluations (Zhou et al., 13 Nov 2025).

5. Supervised Fine-Tuning and Knowledge Distillation

After assembling tuples $G$ 8—where $G$ 9 is the concatenation of reasoning and facet critiques, and $a = (\text{role},\ \text{region},\ \text{disease code},\ \text{intent},\ \text{vagueness},\ \text{completeness},\ \text{style})$ 0 the refined answer—MuSeR applies standard next-token SFT with cross-entropy loss:

$a = (\text{role},\ \text{region},\ \text{disease code},\ \text{intent},\ \text{vagueness},\ \text{completeness},\ \text{style})$ 1

Training is conducted for 6 epochs using AdamW (lr= $a = (\text{role},\ \text{region},\ \text{disease code},\ \text{intent},\ \text{vagueness},\ \text{completeness},\ \text{style})$ 2, batch 16, cosine scheduler, 10% warm-up).

To further enhance medical domain knowledge, MuSeR employs an initial query-guided knowledge distillation stage in which answers from a stronger teacher model (GPT-OSS-120B) supervise the backbone (e.g., Qwen3-32B) using KL divergence at temperature $a = (\text{role},\ \text{region},\ \text{disease code},\ \text{intent},\ \text{vagueness},\ \text{completeness},\ \text{style})$ 3. This procedure boosts baseline HealthBench performance from 46.1% to 56.6% prior to multifaceted self-refinement (Zhou et al., 13 Nov 2025).

6. Empirical Outcomes and Axial Analysis

Evaluation is conducted on the HealthBench dataset (5,000 physician-annotated dialogues, 7 themes, 5 axes—accuracy, completeness, context-awareness, communication, instruction-following; hard subset of 1,000 challenging cases):

Model	HealthBench (%)	Hard Subset (%)
Qwen3-32B (base)	46.1	12.0
+ Query-KD	56.6	31.5
+ KD + MuSeR	63.8	43.1
Teacher (OSS-120B)	57.6	—

Notably, MuSeR yields a +19.4% increase on the context-awareness axis, with theme-specific gains in “context seeking” (+7.6%), “global health” (+5.0%), and “hedging” (+4.0%). Ablations confirm decision-making awareness ( $a = (\text{role},\ \text{region},\ \text{disease code},\ \text{intent},\ \text{vagueness},\ \text{completeness},\ \text{style})$ 4) is most impactful (removal costs 2.7% performance). Qwen3-32B+MuSeR establishes a new state of the art for open-source models, exceeding its own teacher on HealthBench (Zhou et al., 13 Nov 2025).

7. Insights, Limitations, and Prospective Work

Both the knowledge distillation and self-refinement stages are necessary: the former for raw medical proficiency, the latter for context-awareness.
Direct, facet-driven answer refinement outperforms chain-of-thought continuation.
The procedure’s cost-effectiveness and scalability stem from its reliance on synthetic queries and model-internal critique, without requiring private EHR or real patient data.

Current limitations include dependence on synthetic queries and the foundation model’s pre-existing knowledge. There remains an approximate 3% performance gap to the top closed-model (e.g., GPT-5). Future directions outlined involve extending the multifaceted refinement schema to other high-stakes domains (legal, financial), integrating RLHF or real-world dialogue signals for richer self-evaluation, and domain-specific pretraining to close residual gaps (Zhou et al., 13 Nov 2025).

MuSeR demonstrates that explicit, multi-criteria self-refinement—when scaled across synthetic yet realistic contexts—substantially enhances LLMs’ context-awareness for medical applications, and offers a general recipe for context-sensitive LLM adaptation across domains.

Markdown Report Issue Upgrade to Chat

References (1)

Enhancing the Medical Context-Awareness Ability of LLMs via Multifaceted Self-Refinement Learning (2025)

Topic to Video (Beta)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Multifaceted Self-Refinement (MuSeR).

MuSeR: Enhancing Medical Context in LLMs

1. Motivation and Problem Scope

2. Faceted Context-Awareness: Definitions and Sub-Metrics

3. Synthetic Attribute-Conditioned Query Generation

4. Three-Stage Gen–Eval–Refine Pipeline

5. Supervised Fine-Tuning and Knowledge Distillation

6. Empirical Outcomes and Axial Analysis

7. Insights, Limitations, and Prospective Work

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

MuSeR: Enhancing Medical Context in LLMs

1. Motivation and Problem Scope

2. Faceted Context-Awareness: Definitions and Sub-Metrics

3. Synthetic Attribute-Conditioned Query Generation

4. Three-Stage Gen–Eval–Refine Pipeline

5. Supervised Fine-Tuning and Knowledge Distillation

6. Empirical Outcomes and Axial Analysis

7. Insights, Limitations, and Prospective Work

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research