ClinicalGPT-R1: Pushing reasoning capability of generalist disease diagnosis with large language model (2504.09421v2)

Published 13 Apr 2025 in cs.CL and cs.AI

Abstract: Recent advances in reasoning with LLMshas shown remarkable reasoning capabilities in domains such as mathematics and coding, yet their application to clinical diagnosis remains underexplored. Here, we introduce ClinicalGPT-R1, a reasoning enhanced generalist LLM for disease diagnosis. Trained on a dataset of 20,000 real-world clinical records, ClinicalGPT-R1 leverages diverse training strategies to enhance diagnostic reasoning. To benchmark performance, we curated MedBench-Hard, a challenging dataset spanning seven major medical specialties and representative diseases. Experimental results demonstrate that ClinicalGPT-R1 outperforms GPT-4o in Chinese diagnostic tasks and achieves comparable performance to GPT-4 in English settings. This comparative study effectively validates the superior performance of ClinicalGPT-R1 in disease diagnosis tasks. Resources are available at https://github.com/medfound/medfound.

PDF Abstract

This paper introduces ClinicalGPT-R1, a LLM specifically designed to enhance reasoning capabilities for clinical disease diagnosis. It addresses the challenge that while LLMs show promise in general reasoning tasks, applying them effectively to the nuanced and complex reasoning required for medical diagnosis is difficult, partly due to the lack of verifiable intermediate steps in clinical thought processes compared to domains like math or coding.

Data Collection and Synthesis

To train ClinicalGPT-R1, the researchers utilized a combination of real-world and synthetic data:

Real Data:
- MedDX-FT dataset (Zheng, 15 Apr 2024 ), containing medical records and diagnostic reasoning examples.
- Anonymized Electronic Health Records (EHRs).
Synthetic Data Generation: A sophisticated pipeline was developed to create high-quality synthetic reasoning data grounded in real medical records.
- Process: State-of-the-art LLMs (tested with GPT-4o-mini and Deepseek-v3-0324) were prompted using a long-chain reasoning strategy to generate diagnostic steps and conclusions.
- Refinement: If the initial generation failed to reach the correct diagnosis, iterative strategies were applied: Exploring New Paths, Backtracking, Verification, and Corrections. If these failed after multiple attempts, the correct answer and a Chain-of-Thought (CoT) path were provided to guide the final generation.
- Formatting: The generated reasoning steps were reformatted into a natural language "long CoT" format, incorporating transition words. A formal "long response" summarizing the diagnosis was also generated.
- Outcome: The paper found that synthetic data generated by GPT-4o-mini led to better downstream model performance compared to data generated by Deepseek-v3-0324.

Training Methodology

ClinicalGPT-R1 was developed using a two-stage training approach starting with the Qwen-2.5-7B-instruct model as the base:

Supervised Fine-Tuning (SFT):
- Data: The synthetic dataset, structured as {question (patient history/records), thinking (long CoT), final response (diagnosis)}.
- Goal: To teach the model the patterns of clinical reasoning and "think-before-answering."
- Implementation: Fine-tuned for 3 epochs with a cosine decay learning rate schedule (5e-6 start/end). Samples were packed, and masking was used to isolate examples within sequences.
Reinforcement Learning (RL):
- Goal: To further optimize the model's reasoning ability and decision-making, particularly for long CoT generation, beyond just mimicking the SFT data.
- Algorithm: Proximal Policy Optimization (PPO).
- Reward Model: A result-based reward system was used:
  - Reward = 1.0 for correct answers with clear reasoning.
  - Reward = 0.1 for incorrect answers.
  - Reward = 0 for responses lacking explicit reasoning (even if the final answer was correct).
  - A large-scale model served as a verifier to assess correctness.
- Implementation: Trained with a learning rate of 5e-7, batch size 16, PPO beta 0.03, 3 epochs, discount factor 1.0, value coefficient 1.0, and clip range 0.2.

Evaluation Benchmark: MedBench-Hard

To rigorously evaluate diagnostic performance, the researchers created MedBench-Hard:

Composition: 3,500 challenging clinical diagnostic cases (500 per department).
Departments Covered: Respiratory, Gastroenterology, Urology, Cardiology, Immunology, Neurology, and Endocrinology.
Sampling: Stratified sampling based on ICD-10 codes was used to ensure diverse disease representation within each department, including rare diseases, and avoid duplication.

Results and Key Findings

Training Strategy: The combined SFT+RL approach significantly outperformed SFT alone, demonstrating the value of RL in refining reasoning pathways.
Language Comparison: When using the Qwen base model, training and evaluating on Chinese data yielded better results than using English data.
Performance vs. Baselines:
- Chinese: ClinicalGPT-R1 significantly outperformed both its base model (Qwen2.5-7B-Instruct) and GPT-4o on the MedBench-Hard Chinese tasks.
- English: ClinicalGPT-R1 performed comparably to GPT-4o and significantly better than its base model on the MedBench-Hard English tasks.
Catastrophic Forgetting: Testing on MedQA indicated that ClinicalGPT-R1 retained general medical knowledge without significant forgetting after specialized reasoning training.

Practical Implications and Implementation Considerations

Application: ClinicalGPT-R1 demonstrates potential as a tool for assisting clinicians in diagnosis, especially in complex cases requiring detailed reasoning. It could analyze patient records, suggest potential diagnoses, and outline the reasoning steps.
Data: The quality of synthetic data heavily influences performance. Using powerful generator models (like GPT-4o variants) and robust generation/refinement pipelines is crucial. Access to diverse, anonymized real-world clinical data (EHRs, MedDX-FT) is essential for grounding the model.
Training: Implementing the two-stage SFT+RL process requires significant computational resources. The RL phase, involving PPO and a separate verifier model, adds complexity compared to standard SFT. Hyperparameter tuning for both stages (learning rates, PPO parameters) is important.
Deployment:
- Resource Needs: While based on a 7B parameter model, inference might still require capable hardware, especially for generating long reasoning chains.
- Validation: Extensive validation by medical professionals in real-world clinical workflows is necessary before deployment. The model should be seen as an assistant, not a replacement for clinical judgment.
- Language: Performance differences between languages suggest that language-specific tuning or data collection might be needed for optimal results in different regions.
- Benchmarking: MedBench-Hard provides a valuable resource for evaluating diagnostic reasoning capabilities in LLMs, particularly for complex cases across multiple specialties.

In summary, ClinicalGPT-R1 showcases a practical approach combining curated real data, advanced synthetic data generation, and a two-stage SFT+RL training process to build an LLM with enhanced clinical diagnostic reasoning capabilities, validated on a new challenging benchmark.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Wuyang Lan (2 papers)
Wenzheng Wang (2 papers)
Changwei Ji (1 paper)
Guoxing Yang (11 papers)
Yongbo Zhang (6 papers)
Xiaohong Liu (117 papers)
Song Wu (23 papers)
Guangyu Wang (25 papers)

GitHub

GitHub - medfound/medfound (110 stars)

ClinicalGPT-R1: Pushing reasoning capability of generalist disease diagnosis with large language model (2504.09421v2)

Related Papers

GitHub