Papers
Topics
Authors
Recent
Search
2000 character limit reached

MuSeR: Enhancing Medical Context in LLMs

Updated 20 November 2025
  • MuSeR is a data-driven framework that refines LLM responses using facet-specific self-evaluation to address context gaps in medical queries.
  • It integrates synthetic query generation, a three-stage Gen–Eval–Refine pipeline, and knowledge distillation to enhance decision-making and communication.
  • Empirical results on HealthBench show significant gains in context-awareness and safety, outperforming baseline models in realistic clinical settings.

Multifaceted Self-Refinement (MuSeR) is a data-driven framework designed to address the limitations of LLMs in real-world medical scenarios, particularly their underperformance in context-awareness relative to human clinicians. By synthesizing diverse user contexts, leveraging facet-specific self-evaluation, and employing a structured answer refinement process, MuSeR systematically enhances an LLM’s ability to recognize and fill context gaps, tailor communication, and improve overall clinical safety (Zhou et al., 13 Nov 2025).

1. Motivation and Problem Scope

Standard medical QA benchmarks typically present well-formed questions containing all relevant contextual details, such as age, medical history, and comorbidities. This fails to capture the complexity and “messiness” of actual clinical interactions, which often omit critical facts or introduce ambiguity. Off-the-shelf LLMs address each prompt in isolation and, as a result, struggle with:

  • Recognizing missing or ambiguous information required for safe recommendations,
  • Adapting language and detail to different users (e.g., layperson, clinician, or caregiver),
  • Identifying and explicitly flagging medical and ethical risks.

MuSeR operationalizes the insight that context sensitivity can be bootstrapped through self-refinement: by simulating realistic, noisy queries, having the model critique its own answers along multiple axes, and incorporating these structured critiques in subsequent supervised fine-tuning, an LLM can internalize and operationalize habits crucial for human-level medical context-awareness (Zhou et al., 13 Nov 2025).

2. Faceted Context-Awareness: Definitions and Sub-Metrics

MuSeR decomposes “context-awareness” into three orthogonal but complementary facets, each mapped to implicit sub-metrics guiding the self-evaluation stage:

Facet Objective Implicit Sub-Metrics
Decision-Making (f₁) Identify missing or ambiguous details essential to safety Completeness of info, follow-up question generation
Communication (f₂) Adapt terminology, tone, and detail to user identity and preferences Register, brevity vs. depth, coherence with user style
Safety (f₃) Flag potential clinical risks or ethical boundaries Harm avoidance, uncertainty signaling, SoC reference

Each facet is mapped to axes in the HealthBench rubric but is repurposed in MuSeR to drive model-internal critique and revision, not just final scoring (Zhou et al., 13 Nov 2025).

3. Synthetic Attribute-Conditioned Query Generation

Real-world medical queries are modeled as samples from an unknown latent distribution P(q)P^*(q). MuSeR constructs an approximate generator GG parameterized by a vector of discrete attributes a=(role, region, disease code, intent, vagueness, completeness, style)a = (\text{role},\ \text{region},\ \text{disease code},\ \text{intent},\ \text{vagueness},\ \text{completeness},\ \text{style}), formalizing:

P(q)PG(q)=aPAttr(a)G(qa)P^*(q) \approx P_G(q) = \sum_a P_{\text{Attr}}(a) \cdot G(q|a)

Where PAttr(a)P_{\text{Attr}}(a) is a manually defined prior over attributes (e.g., Role: Patient 0.7, Caregiver 0.2, Doctor 0.1). A smaller open-source LLM (DeepSeek-V3) is prompted with templates enumerating attribute values to generate realistic queries. This process is repeated at scale (N=100000N=100\,000) to yield a broad synthetic dataset reflecting realistic variations in medical query formulation (Zhou et al., 13 Nov 2025).

4. Three-Stage Gen–Eval–Refine Pipeline

Given a synthetic query qq, the backbone LLM MM executes a three-stage procedure:

  1. Initial Draft Generation:

(t0,r0)Gen(M,q)(t_0, r_0) \leftarrow \text{Gen}(M, q) t0t_0 is the chain-of-thought (reasoning trace); r0r_0 is the initial response.

  1. Facet-Wise Self-Evaluation:

For each facet fif_i, the model generates a critique sis_i: siEval(M;q,r0;fi)s_i \leftarrow \text{Eval}(M; q, r_0; f_i) For example, for f1f_1: "We lack the patient’s medication history."

  1. Direct Answer Refinement:

(r)Refine(M;q,r0,{si})(r') \leftarrow \text{Refine}(M; q, r_0, \{s_i\}) The model refines the initial answer based on aggregated critique. Explicit, facet-driven refinement leads to improved performance over naive continuation of chain-of-thought (+2.9% overall, +6.3% hard cases on HealthBench).

No explicit self-refinement loss is minimized, but the refinement process can be viewed as aligning PR(q)P_{R}(\cdot|q) with a hypothetical oracle P(q)P^*(\cdot|q) via KL minimization, constrained by the three facet evaluations (Zhou et al., 13 Nov 2025).

5. Supervised Fine-Tuning and Knowledge Distillation

After assembling tuples (q,t,r)(q, t', r')—where tt' is the concatenation of reasoning and facet critiques, and rr' the refined answer—MuSeR applies standard next-token SFT with cross-entropy loss:

Lfine(θ)=E(q,t,r)[logPθ(rq,t)]L_{\text{fine}}(\theta) = - \mathbb{E}_{(q, t', r')} \left[ \log P_{\theta}(r'|q, t') \right]

Training is conducted for 6 epochs using AdamW (lr=5×1065 \times 10^{-6}, batch 16, cosine scheduler, 10% warm-up).

To further enhance medical domain knowledge, MuSeR employs an initial query-guided knowledge distillation stage in which answers from a stronger teacher model (GPT-OSS-120B) supervise the backbone (e.g., Qwen3-32B) using KL divergence at temperature T=0.6T=0.6. This procedure boosts baseline HealthBench performance from 46.1% to 56.6% prior to multifaceted self-refinement (Zhou et al., 13 Nov 2025).

6. Empirical Outcomes and Axial Analysis

Evaluation is conducted on the HealthBench dataset (5,000 physician-annotated dialogues, 7 themes, 5 axes—accuracy, completeness, context-awareness, communication, instruction-following; hard subset of 1,000 challenging cases):

Model HealthBench (%) Hard Subset (%)
Qwen3-32B (base) 46.1 12.0
+ Query-KD 56.6 31.5
+ KD + MuSeR 63.8 43.1
Teacher (OSS-120B) 57.6

Notably, MuSeR yields a +19.4% increase on the context-awareness axis, with theme-specific gains in “context seeking” (+7.6%), “global health” (+5.0%), and “hedging” (+4.0%). Ablations confirm decision-making awareness (f1f_1) is most impactful (removal costs 2.7% performance). Qwen3-32B+MuSeR establishes a new state of the art for open-source models, exceeding its own teacher on HealthBench (Zhou et al., 13 Nov 2025).

7. Insights, Limitations, and Prospective Work

  • Both the knowledge distillation and self-refinement stages are necessary: the former for raw medical proficiency, the latter for context-awareness.
  • Direct, facet-driven answer refinement outperforms chain-of-thought continuation.
  • The procedure’s cost-effectiveness and scalability stem from its reliance on synthetic queries and model-internal critique, without requiring private EHR or real patient data.

Current limitations include dependence on synthetic queries and the foundation model’s pre-existing knowledge. There remains an approximate 3% performance gap to the top closed-model (e.g., GPT-5). Future directions outlined involve extending the multifaceted refinement schema to other high-stakes domains (legal, financial), integrating RLHF or real-world dialogue signals for richer self-evaluation, and domain-specific pretraining to close residual gaps (Zhou et al., 13 Nov 2025).

MuSeR demonstrates that explicit, multi-criteria self-refinement—when scaled across synthetic yet realistic contexts—substantially enhances LLMs’ context-awareness for medical applications, and offers a general recipe for context-sensitive LLM adaptation across domains.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multifaceted Self-Refinement (MuSeR).