DxDirector-7B: Clinical Diagnosis LLM
- DxDirector-7B is a 7B-parameter language model designed for full-process clinical diagnosis that integrates multi-step reasoning with autonomous decision making.
- It utilizes a three-phase training pipeline, including continued pre-training, instruction-tuning with stepwise clinical reasoning, and reinforcement learning to optimize diagnostic strategy.
- The model outperforms larger medical LLMs by achieving superior diagnostic accuracy and reducing reliance on physician intervention across key clinical benchmarks.
The instruction-tuning phase is a critical supervised fine-tuning stage in the development of LLMs for specialized applications requiring multi-step, domain-specific reasoning and action. In clinical diagnosis, this phase enables an LLM to autonomously direct the full diagnostic workflow from an ambiguous patient complaint to a final diagnosis, learning complex task decomposition and responsible delegation between autonomous inference and human assistance. The methodology and advances in instruction-tuning are exemplified in the construction of DxDirector-7B, a 7B-parameter LLM fine-tuned for full-process clinical diagnosis (Xu et al., 14 Aug 2025).
1. Position of the Instruction-Tuning Phase within Model Training
Instruction-tuning constitutes the second stage in the three-phase training pipeline of DxDirector-7B. The sequence is as follows:
- Continued Pre-training: Adapts the base Llama-2-7B model to medical text via next-token prediction on clinical guidelines, PubMed abstracts, full papers, and experience-replay from Wikipedia/ArXiv.
- Instruction-Tuning for Full-Process Diagnosis: Supervised fine-tuning on curated stepwise instruction–response trajectories representing the entire diagnostic workflow.
- Step-Level Strategy Preference Optimization: Reinforcement learning (reward-based, using Direct Preference Optimization) to select optimal diagnostic strategies with minimal physician assistance.
This placement ensures that instruction-tuning operates on top of a domain-specialized LLM, facilitating effective mapping from ambiguous instructions to a controlled, stepwise reasoning process.
2. Construction of Instruction–Response Data
The instruction-tuning phase leverages 10,178 high-quality step-by-step instruction–response pairs derived from MedQA [Jin et al. ’21]. Full clinical cases are paraphrased—first into vague, patient-style chief complaints, then into open-ended clinical questions. Multi-step reasoning chains are generated using GPT-4o (notably, the o1-preview variant for “deep thinking”), following a schema:
- [Deep Think]: A free-form explication of intermediate reasoning.
- [Question]: Labeled as <LLM> (can be directly answered by the LLM) or <Physician> (requires external action or information).
- [Answer]: The direct response, either inferred by the LLM or provided by a "simulated" physician.
Expert review and correction ensure medical and logical soundness, supplying high-signal, process-level supervision required for complex task chaining.
3. Objective Function and Segmentation of Instructional Supervision
Instruction-tuning in DxDirector-7B employs a dual loss to explicitly model both deep reasoning and medical knowledge recall:
Let denote all response tokens. Define as tokens belonging to [Deep Think] and [Question] segments.
- Reasoning Loss:
This targets the “deep thinking” and question-proposing tokens, learning stepwise cognitive elaboration.
- Knowledge Recall Loss:
Here, is the chief complaint and question, is auxiliary clinical data (e.g., test results when physician assistance is invoked), focusing the model on factual accuracy or retrieval.
This segregation enforces granularity in supervision, distinguishing cognitive process modeling from factual retrieval, crucial for operational autonomy in full-process diagnosis (Xu et al., 14 Aug 2025).
4. Instruction-Tuning Schema for Full-Process Clinical Reasoning
The selected instruction–response pairs encode multi-hop, action-conditioned reasoning proceeds as:
- Start from a chief complaint (often vague).
- Iteratively:
- Generate [Deep Think]: summarize current information and reasoning toward the next decision.
- Propose [Question]: decide if the LLM can answer from knowledge or must request new patient data or procedures (<LLM> vs <Physician>).
- Produce [Answer]: provide inference or request.
- Integrate any new data from physician assistance back into the process.
- Terminate: only when sufficient information is gathered for high-confidence diagnosis.
This explicit structure enables the LLM to orchestrate complex workflows while minimizing human involvement.
5. Outcomes and Significance in Downstream Performance
Instruction-tuning demonstrably enables DxDirector-7B to surpass both general-purpose and larger medical LLMs—including models with 176B parameters—on multiple diagnostic benchmarks. The model achieves top diagnostic accuracy in RareArena (36.23%), NEJM cases (38.40%), ClinicalBench (63.46%), and USMLE (50.88%), representing 7–12% absolute gains above the best prior models while using ≈4% of their parameter count. Notably, DxDirector-7B limits physician requests to 2.9–3.2 per case (vs. 7.8–9.7 in baselines) with 97–98% of those requests deemed necessary (helpfulness) (Xu et al., 14 Aug 2025).
This suggests that instruction-tuning for full-process, stepwise clinical reasoning—not merely factual answering—enables efficient, autonomous diagnostic workflows that dramatically reduce requisite human labor.
6. Broader Implications and Future Directions
Instruction-tuning, structured around explicit multi-step workflows and dual loss segmentation, underpins scalable, safe, and auditable autonomy in specialized domains. Fine-grained accountability (e.g., LLM vs. physician attribution at each decision node) is made possible by the systematic structuring of the instruction-response schema. A plausible implication is that similar instruction-tuning methodologies could extend to other high-stakes, process-driven fields—including legal analysis, scientific protocol design, and industrial troubleshooting.
Continuing advancements are expected to explore department-specific policies, integration with vision-language and laboratory models, and extensions to outpatient triage, remote care, and continuous patient monitoring (Xu et al., 14 Aug 2025).