Instruction-Tuning Phase in Clinical Diagnosis
- Instruction-tuning phase is a supervised fine-tuning stage that adapts a pre-trained medical LLM to perform structured, multi-step diagnostic reasoning.
- It employs a dual-objective loss function to concurrently optimize deep reasoning and accurate medical knowledge recall during clinical workflows.
- Empirical results demonstrate improved diagnostic accuracy and reduced physician interventions, outperforming baselines by 7–12 percentage points.
The instruction-tuning phase is a supervised fine-tuning stage within the multi-stage training protocol of DxDirector-7B, a LLM designed to autonomously conduct full-process clinical diagnosis starting from an ambiguous chief complaint. This phase specifically adapts a pre-trained, domain-adapted base model (Llama-2-7B) to learn multi-step clinical reasoning, communication, and question generation in realistic diagnostic workflows, with an emphasis on "deep thinking" and accurate medical knowledge recall (Xu et al., 14 Aug 2025).
1. Role and Context in Model Training
Instruction-tuning for DxDirector-7B constitutes the second out of three core training stages:
- Continued pre-training on medical corpora for domain-specific language acquisition.
- Instruction-tuning for full-process diagnosis (supervised fine-tuning) using curated, stepwise clinical cases.
- Step-level strategy preference optimization via reward-based reinforcement learning.
The instruction-tuning phase is essential for embedding procedural clinical reasoning and task-aligned communication capabilities, bridging the gap between general language proficiency and the domain- and task-specific requirements of clinical diagnostic agent autonomy.
2. Construction of Instruction-Response Dataset
A custom instruction–response dataset of 10,178 high-quality, "step-by-step" clinical demonstration pairs was constructed from MedQA (Jin et al. ’21). The curation workflow involves:
- Extraction of complete clinical information per case, followed by rewriting into (i) a patient-style chief complaint and (ii) an open-ended diagnostic question.
- Use of GPT-4o with enhanced “deep thinking” prompting to generate multi-step reasoning chains formatted as:
[Deep Think][Question]<LLM>or[Question]<Physician>[Answer]- The chain continues iteratively until a final diagnosis.
- All outputs are reviewed and corrected by medical experts to ensure data fidelity and clinical plausibility (Xu et al., 14 Aug 2025).
This dataset serves to explicitly teach the model to orchestrate multi-turn diagnostic reasoning involving both automated inference and human-in-the-loop requests.
3. Instruction-Tuning Loss Functions and Objectives
To concurrently enforce robust clinical reasoning and factual medical recall, instruction-tuning employs a split loss function:
- Let denote all response tokens, and those within “Deep Think” and “Question” blocks.
- The reasoning loss focuses on “thinking” and the generation of diagnostic questions:
- The knowledge recall loss acts on the remaining tokens:
where is the instruction (chief complaint and question), and is auxiliary clinical data provided in response to physician requests.
This dual-objective formulation enforces stepwise reasoning proficiency while anchoring factual accuracy, particularly in segments where the model is expected to emulate clinical “deep thinking” (Xu et al., 14 Aug 2025).
4. Integration into the Full-Process Clinical Diagnosis Workflow
The instruction-tuned model forms the core of DxDirector-7B’s diagnostic loop:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
I ← {chief_complaint} history ← [] repeat Dt ← DeepThink(I, clinical_goal) Qt ← GenerateQuestion(Dt) if Qt.type = “LLM”: At ← LLM_infer(Qt) else: request(Qt) At ← receive_physician_input() I ← I ∪ {At} history.append((Dt,Qt,At)) until TerminationCriterion(I) FinalDiagnosis ← Summarize(history) return FinalDiagnosis |
At each iteration, the model:
- Integrates current information and clinical goals through a “DeepThink” summary.
- Generates an actionable question (either knowledge/inference or requiring real-world physician input).
- Incorporates new data or responses and repeats until clinical sufficiency is achieved.
- Outputs a structured, literature-referenced summary of each reasoning step (Xu et al., 14 Aug 2025).
5. Empirical Impact and Benchmark Performance
Instruction-tuning enables significant performance gains in multi-benchmark evaluations:
- On RareArena, NEJM, ClinicalBench, and USMLE datasets, DxDirector-7B consistently achieves 7–12 percentage point gains over the strongest baseline LLMs, while using approximately 4% of the parameter count.
- The model outperforms human physicians (32.5% accuracy on NEJM cases [Brodeur’24]) on rare and complex cases, achieving 38.40% accuracy.
- Stepwise instruction-tuning facilitates drastic reduction in physician requests per case (e.g., 2.9–3.2 for DxDirector-7B vs. 7.8–9.7 for baselines), with over 97% of requests being “helpful” in gold-standard case review (Xu et al., 14 Aug 2025).
6. Significance, Limitations, and Future Work
The instruction-tuning phase is central to the paradigm shift in clinical AI roles. Explicit “deep thinking” and structured query generation, as taught during this phase, enable a compact 7B-parameter model to outperform much larger medical and general-purpose LLMs. The approach offers scalable, cost-effective diagnostic automation with minimal physician involvement, high accountability, and traceable stepwise reasoning.
Limitations and avenues for future work include:
- Designing department-specific intervention heuristics for further workload reduction.
- Integration with vision-language and laboratory analysis models to support holistic diagnostic tasks.
- Extending the paradigm to outpatient triage, remote medicine, and longitudinal patient monitoring (Xu et al., 14 Aug 2025).
A plausible implication is that instruction-tuning tailored to the operational constraints and iterative nature of real-world diagnostic reasoning may generalize to other complex, multi-step professional domains.