Baichuan-M2: 32B Medical Reasoning Model

Updated 4 July 2026

Baichuan-M2 is a 32B-parameter medical augmented reasoning model designed for realistic clinical decision-making using a dynamic verifier with patient simulation and rubric-based rewards.
It employs a multi-stage training pipeline that includes medical domain mid-training, supervised fine-tuning on millions of samples, and multi-turn reinforcement learning for context-sensitive responses.
The model demonstrates competitive benchmark results on HealthBench with efficient deployment via advanced quantization, while retaining non-medical capabilities.

Searching arXiv for Baichuan-M2 and closely related Baichuan medical/model-family papers. Baichuan-M2 is a 32B-parameter medical augmented reasoning model developed to improve LLM performance in realistic clinical decision-making rather than only on static medical examinations. It is presented as an open-source medical AI system trained with a dynamic verifier system and multi-stage reinforcement learning, and is built on Qwen2.5-32B-Base rather than a pre-aligned chat model (Team et al., 2 Sep 2025). The name should not be conflated with Baichuan 2, the earlier general-purpose multilingual model family of 7B and 13B decoder-only LLMs; the Baichuan 2 technical report does not mention Baichuan-M2 as a separate model, variant, or alias (Yang et al., 2023).

1. Identity within the Baichuan model family

Baichuan-M2 belongs to the Baichuan medical line rather than to the general multilingual Baichuan 2 family. The Baichuan 2 report describes open, multilingual, decoder-only LLMs trained from scratch on 2.6 trillion tokens, with released 7B and 13B base and chat variants, but it explicitly does not identify any model called Baichuan-M2 (Yang et al., 2023). By contrast, Baichuan-M2 is introduced as a 32B-parameter medical augmented reasoning model optimized for clinical interaction and verification-intensive reinforcement learning (Team et al., 2 Sep 2025).

Later work on Baichuan-M3 explicitly positions M3 as an advance over M2 and confirms several lineage facts: Baichuan-M2 introduced a patient simulator for doctor–patient interaction, introduced a rubric-based verifier paradigm, and mainly relies on rubric-based rewards to shape medical reasoning. The M3 report also confirms that M2 had at least a 32B variant through the baseline “Baichuan-M2-32B” (Team et al., 6 Feb 2026). This establishes Baichuan-M2 as a distinct intermediate generation in the Baichuan medical series, separate from both Baichuan 2 and the later Baichuan-M3.

A common source of confusion is the similarity between “Baichuan 2” and “Baichuan-M2.” The available reports support a strict distinction: Baichuan 2 is a general multilingual LLM family, whereas Baichuan-M2 is a medical model trained with a domain-specific dynamic verification framework.

2. Clinical problem formulation and design objective

The central claim behind Baichuan-M2 is that strong performance on static medical exams is insufficient for real clinical usefulness. The model is intended to bridge the gap between high scores on benchmarks such as USMLE-style tests and the demands of practice, which involve multi-turn interaction, incomplete information, diagnostic exploration, communication, empathy, and medical ethics (Team et al., 2 Sep 2025).

The motivating critique is directed at the verifier side of medical LLM training as much as at the model itself. In the paper’s framing, prior systems often depend on static QA datasets, exam-style benchmarks, answer matching or rule-based verifiers, and supervised fine-tuning on medical instructions. These tools are inadequate for clinical consultation because they do not capture partial observability, multi-turn exploration, dynamic judgment, or the multidimensional character of medical quality. Binary correctness is therefore treated as too narrow a target.

This leads to the paper’s broader thesis: medicine differs from domains such as math and coding because reward quality cannot be reduced to answer matching. Baichuan-M2 is accordingly designed not merely to emit correct diagnoses, but to behave more like a physician in an evolving consultation: eliciting missing information, updating judgment as new evidence appears, tailoring explanation to the patient, and respecting ethical and safety norms. A plausible implication is that the paper views verifier design as a primary bottleneck for medical RL, not just model scale or domain pretraining.

3. Dynamic verifier system

The core technical contribution is a dynamic interactive reinforcement learning framework built around two modules: a Patient Simulator and a Clinical Rubrics Generator (Team et al., 2 Sep 2025).

Module	Role	Notable details
Patient Simulator	Creates realistic multi-turn clinical encounters	Uses de-identified records; includes personality and sociocultural factors
Clinical Rubrics Generator	Produces context-dependent evaluation criteria	Rubric weights are integers in [-10, 10]; supports multi-dimensional scoring

The Patient Simulator is built from de-identified medical records, doctor-patient conversation records, curated clinical datasets, and cases spanning multiple specialties and population groups. Each patient script combines medical information—such as chief complaint, history of present illness, and past medical history—with psychological and sociocultural information. The paper explicitly cites personality traits inspired by the MBTI 16-type model, along with background factors such as financial constraints and education level. The purpose is to model patients as socially situated agents rather than static symptom containers.

Architecturally, the simulator uses three components: a Termination Gate, an Affective Unit, and a Fact Unit. The Termination Gate determines when the conversation should end; the Affective Unit generates responses aligned with personality and social profile; and the Fact Unit verifies consistency with the patient profile in real time and prevents factual contradictions, unsupported disclosures, and information leakage. The paper states that the Affective Unit and Factual Unit are implemented using LLMs and that a non-thinking model is used to quickly determine termination conditions and verify facts. This modularization is presented as a response to concrete simulator failure modes, including leakage, inconsistency, and poor termination control.

The Clinical Rubrics Generator provides the evaluation counterpart. Its stated design goals are Comprehensiveness, Reliability, and Adaptiveness. Prompt sources for rubric construction include medical record–driven prompts, knowledge base–driven prompts derived from textbooks, research papers, clinical guidelines, and pharmacopoeias, and synthetic scenario prompts covering tasks such as triage, note writing, physical exam interpretation, instruction following, and multi-turn coherence. Rubric construction proceeds through clinician-defined core dimensions, LLM-generated candidate rubrics, expert selection and customization, and weight annotation with integer scores in [-10, 10]. Listed rubric dimensions include diagnostic accuracy, inquiry logic, treatment rationality, communication and empathy, and medical ethics.

Rubric quality is not left entirely implicit. The paper reports that, on 100 cases evenly sampled across categories, the trained rubrics generator achieved 92.7% consistency rate with expert-annotated rubrics, counting dimension-level matches rather than exact wording. It also introduces separate prompting templates for positive rubrics and negative rubrics, with JSON-style outputs using "acceptable" versus "unacceptable", in order to reduce evaluator confusion. For serving efficiency, an affinity mechanism routes rubric prompts sharing the same dialogue prefix to the same instance to improve KV cache reuse.

Taken together, these components define what the paper calls a “virtual clinical world”: the model acts, the patient simulator responds, the rubrics generator evaluates, and reinforcement learning updates the policy. This suggests a shift from static answer verification to process-sensitive clinical supervision.

4. Training pipeline and optimization

Baichuan-M2 is trained in three broad phases: mid-training for medical domain adaptation, supervised fine-tuning (SFT), and multi-stage reinforcement learning (Team et al., 2 Sep 2025).

The model starts from Qwen2.5-32B-Base. The authors state that, in internal comparisons, this base model gave better training stability than Qwen3-32B and avoided degradation from pre-existing alignment. Mid-training uses a medical corpus consisting of public medical textbooks, clinical monographs, drug knowledge bases, latest clinical diagnosis and treatment guidelines, and de-identified real medical record reports. Two explicit data-enhancement procedures are described. Structured Rephrasing rewrites medical texts for clarity and coherence under strict knowledge-fidelity constraints, while Explicit CoT Injection inserts “thinking notes” such as knowledge association, reflection, argument verification, and case deduction into knowledge-dense paragraphs and key conclusions. To preserve broader competence, the data mixture combines medical, general, and mathematical reasoning corpora in a 2:2:1 ratio.

The SFT stage is intended to stabilize subsequent RL and improve exploration quality. A candidate pool of over 4 million samples is assembled from in-house Baichuan-M1 datasets and external open-source datasets. DeepSeek-R1 is used as the primary chain-of-thought generator for complex reasoning chains. The final SFT dataset contains 2 million samples, with ~20% medical-related data. Training uses Qwen2.5-32B-Base, a 32K context length, and 2 epochs.

Reinforcement learning is staged into Rule-based RL, Rubric-based RL, and Multi-turn RL. The first stage uses tasks with definitive answers—including mathematics, programming, general instruction-following, medical knowledge QA, and medical diagnosis—to strengthen reasoning while retaining general performance. The second stage moves to open-ended medical prompts such as initial consultations, case analysis, treatment explanations, medication education, and prognosis. The third stage places the model in dialogue with the patient simulator and scores context slices using dynamically generated rubrics.

The RL algorithm is an improved form of Group Relative Policy Optimization (GRPO). The paper describes four modifications: Removing KL divergence, Asymmetric clipping, Length-normalized loss, and Simplified advantage normalization. The group-relative advantage is defined as

$\hat{A}_{i,t} = R(q, o_i) - \text{mean}(\{R(q, o_1), \ldots, R(q, o_G)\}),$

which removes the need for a separate value model by comparing each sampled output against the group mean. For rubric-based RL, the reward is

$R(q,o_i) = R_{\text{rubric}(q,o_i)} + R_{\text{length}(q,o_i)},$

where $R_{\text{rubric}}$ is a normalized rubric score and $R_{\text{length}}$ is a conditional brevity bonus. The conceptual purpose of the length term is explicit: conciseness is rewarded only when quality is already high, reducing the tendency toward short but incomplete answers.

The most distinctive stage is Multi-turn RL. After each interaction, a slice of dialogue history is extracted, the rubrics generator creates context-specific criteria, and the next response is scored and reinforced. The paper states that the training signal is applied at the fragment level rather than over entire sessions because simulators can still produce repeated generations, overly long dialogues, and role inversion. Only semantically coherent and causally plausible dialogue fragments are retained. The authors explicitly note that they have not yet extended RL to full dialogue-session optimization.

5. Empirical performance and benchmark profile

The paper’s primary evaluation centers on HealthBench, described as containing 5,000 realistic multi-turn conversations, 48,562 rubric criteria, and prompts written by 262 human doctors (Team et al., 2 Sep 2025). The headline quantitative result is 34.7 on HealthBench Hard, while GPT-5 is reported at 46.2. The paper further states that, when HealthBench Hard was released, no model could score above 32, that many leading models even scored 0, and that Baichuan-M2 and GPT-5 are the only two models worldwide above 32.

The model is also reported to lead several HealthBench axes, all ranked 1st in the paper’s presentation: Emergency Referrals: 74.6, Context Awareness: 48.0, Context Seeking: 55.8, Communication: 68.6, Global Health: 57.1, and Completeness: 67.2. These dimensions are significant because they align closely with the paper’s stated objective of improving interactive consultation rather than only exam-style correctness.

A second evaluation setting uses 57 complex clinical cases from multidisciplinary treatment sessions in top-tier Chinese hospitals. These cases are described as authentic, complex, and averaging around 3,000 Chinese characters, with expert evaluation across Communication, Examination, Diagnosis, Treatment, and Safety. Against gpt-oss-120B, Baichuan-M2 receives 67% preference in Communication, 45% in Examination, 43% in Diagnosis, 37% in Treatment, and 34% in Safety. The paper attributes part of this advantage to alignment with the Chinese medical ecosystem and guidelines.

The model is also evaluated for retention of non-medical capability against Qwen3-32B (Thinking). Reported scores are 83.4 on AIME24 versus 81.4, 72.9 on AIME25 versus 72.9, 86.0 on IFEval versus 85.0, 77.6 on CF-Bench versus 75.7, 45.8 on Arena-Hard-V2.0 versus 44.5, 8.77 on AlignBench versus 8.72, and 8.56 on WritingBench versus 7.90. These results are used to support the claim that medical specialization did not erase general competence.

The paper also provides a qualitative case on HealthBench Hard involving a 32-week pregnant patient with gestational diabetes whose fasting glucose is around 105 mg/dL on 16 units basal insulin, with ACOG recommending intensification above 95 mg/dL. In the reported comparison, Baichuan-M2 recommends a conservative adjustment, highlights avoiding hypoglycemia, includes fetal assessment, and notes collaboration with diabetes educators and dietary guidance, whereas gpt-oss-120B is described as less complete and less safe. The example is intended to illustrate how the verifier-trained model behaves in clinically consequential reasoning rather than only in answer retrieval.

6. Deployment characteristics, limitations, and subsequent development

Baichuan-M2 is explicitly presented as deployable at comparatively modest hardware scales for a 32B model. The paper reports quantization with W4A16 via AutoRound and W4A8 with Hadamard transform plus GPTQ, along with QQQ packing and FP8 E4M3 KV-cache quantization, and states compatibility with SGLang and vLLM (Team et al., 2 Sep 2025). On a single RTX 4090, the reported maximum sequence lengths are 9,982 for W4A16, 19,965 for W4A16-KV8, 10,566 for W4A8, and 21,133 for W4A8-KV8. For speculative decoding, the paper trains a lightweight draft model and uses Eagle-3 speculative sampling; on a single RTX 4090 with 4-bit quantization and a 4096-token prompt, it reports 73% draft-model prediction accuracy, 3.28 tokens average accepted length, 41.5 → 89.9 tokens/s throughput, and 2.17× speedup.

The limitations are substantial and stated directly. The model may still hallucinate, show insufficient reasoning stability, and fail in edge cases. Performance is said to be far from saturated. The current version is not fully optimized for tool calling and external knowledge retrieval, which constrains its ability to access up-to-date guidelines or integrate external systems. Its multi-turn RL remains fragment-level rather than full-session. The paper emphasizes the use of de-identified or desensitized records, but broader issues of governance, consent, security, safety auditing, oversight, regulatory approval, and liability remain open.

Subsequent work clarifies both M2’s strengths and its limitations. The Baichuan-M3 report states that M2 mainly relies on rubric-based rewards to shape medical reasoning, and presents M3 as extending M2 with fact-aware verification, Dynamic Rubric Evolution, and a modified patient simulator for more stable long-horizon training (Team et al., 6 Feb 2026). In the M3 evaluation, Baichuan-M2-32B appears as a baseline with HealthBench score 60.1, Refuted Rate 5.73%, and Uncertain Rate 5.43%. The same report says M3 improves especially on context seeking and context awareness. This later comparison suggests that Baichuan-M2 established the verifier-centric medical RL framework, while also exposing the need for stronger hallucination control and more robust consultation workflows.

In that sense, Baichuan-M2 is best understood as a medical LLM whose main contribution lies not only in its parameter count or benchmark score, but in its operational claim that medical capability depends critically on the quality of the verifier system used during reinforcement learning.