Patient Simulator Systems
- Patient simulators are systems that replicate patient experiences, dialogue behaviors, and physiological state evolution for education and evaluation.
- They use diverse methodologies, including immersive mixed reality, LLM-driven standardized patients, and digital twins to enhance training and reduce patient anxiety.
- These systems enable scalable, repeatable, and controlled experimentation while addressing challenges in fidelity, multimodal realism, and standardized evaluation.
Patient simulator denotes a family of systems that reproduce selected aspects of patient experience, behavior, or physiology for clinical preparation, medical education, model benchmarking, and treatment-policy experimentation. Recent arXiv literature uses the term for markedly different artifacts: immersive mixed-reality rehearsal for first-time CT patients; LLM-driven standardized patients for history taking, counseling, and OSCE-style practice; controllable, persona-grounded virtual patients for evaluating doctor models; and latent-state or multi-agent simulators that emulate physiological trajectories under treatment actions (Smith et al., 3 Oct 2025, Hicke et al., 1 Mar 2025, Kiani et al., 2019, Sabour et al., 12 Feb 2026).
1. Conceptual scope and major categories
Across current work, patient simulators are organized less by interface than by what is being simulated. Some systems simulate the patient’s experience of care, some the patient’s dialogue behavior, and some the patient’s state evolution under interventions. A recurrent design distinction is whether the simulator is intended for direct patient use, for training clinicians, or for evaluating other AI systems.
| Category | Core role | Representative systems |
|---|---|---|
| Preparatory patient simulator | Prepare real patients emotionally and physically for an upcoming procedure | MR CT preparation simulator (Smith et al., 3 Oct 2025) |
| Virtual standardized patient | Support communication-skills practice and formative feedback | MedSimAI (Hicke et al., 1 Mar 2025), CureFun (Li et al., 2024), CLiVR (Amithasagaran et al., 21 Oct 2025) |
| Persona-grounded virtual patient | Vary language, recall, confusion, or personality in controlled ways | PatientSim (Kyung et al., 23 May 2025), MSPRP / Ch-PatientSim (Jiang et al., 16 Jan 2026) |
| Evaluation testbed | Stress-test doctor LLMs, triage agents, or decision aids | SAPS/AIE (Liao et al., 2024), antidepressant-selection risk simulator (Shawon et al., 11 Feb 2026), triage Patient Simulator (Rashidian et al., 4 Jun 2025) |
| World model / digital twin | Simulate latent state transitions or organ-system trajectories over time | Sepsis World Model (Kiani et al., 2019), Organ-Agents (Chang et al., 20 Aug 2025) |
A common misconception is that patient simulators are merely chatbots that answer medical questions. The literature suggests a broader interpretation. Some systems emphasize role fidelity and selective disclosure; others emphasize physiological transition dynamics, environmental rehearsal, or population-level stress testing of healthcare agents. Another misconception is that patient simulators are always substitutes for human standardized patients. Several papers explicitly frame them as supplements rather than replacements, especially when nonverbal nuance, emotional labor, or bedside interaction remains central (Chu et al., 2024, Amithasagaran et al., 21 Oct 2025).
2. Patient-facing preparation and procedural rehearsal
A distinct branch of the literature uses the patient simulator not to train clinicians, but to prepare the actual patient for an imminent procedure. The mixed-reality CT preparation system in "Immersive Mixed Reality Simulator for CT Scan Preparation: Enhancing Patient Emotional and Physical Readiness" was designed for adult first-time CT patients, with the explicit aim of reducing anxiety and improving cooperation without relying on sedation. Its workflow includes baseline anxiety assessment, headset fitting, a guided CT room tour, relaxation training, breath-hold rehearsal, simulated scan with sound and motion cues, debrief, optional FAQs, and return to staff. The system runs on an Oculus Quest 2 standalone headset in Unity3D, uses gaze-based interaction only, stores scenario content in JSON, runs at about 72 fps, and delivers an experience of about 10 minutes. In a randomized pilot with 50 adult first-time CT patients, baseline STAI-State was similar (46.2 MR vs 45.5 control, ), but immediately before scan the MR group scored 34.8 versus 41.6 for controls; the reported reduction from baseline was with . The MR group also showed 22/25 = 88\% first-try breath-hold success versus 15/25 = 60\%, 80\% below the anxiety threshold of 40 versus 40\% in controls, no anxiolytic use versus 12\% low-dose lorazepam use in controls, and no report of simulator sickness (Smith et al., 3 Oct 2025).
The technical logic of this preparatory simulator is notable because it combines four mechanisms usually discussed separately in anxiety-reduction literature: information provision, guided relaxation, exposure/desensitization, and rehearsal of the required physical behavior. The patient sees a virtual CT suite with gantry, table, control-room window, lighting, and equipment; hears CT-like spatialized sounds such as gantry hum, table motor noise, scan initiation beeps, and breath-hold prompts; practices a 10-second breath-hold with a visible countdown timer; and receives nonjudgmental reassurance through a technologist avatar. The stated purpose is that the patient should feel like they have “already done” the procedure once.
This suggests an important conceptual expansion of the field. In this formulation, a patient simulator need not simulate pathology at all. It can simulate the care pathway itself, turning environmental familiarity, bodily compliance, and anticipatory affect into the primary targets of the intervention.
3. Standardized patients for communication, counseling, and immersive training
The best-developed branch of recent work treats the patient simulator as a virtual standardized patient for repeated communication practice. MedSimAI is an AI-powered platform for pre-clerkship medical learners preparing for formative OSCE-style assessments. Students choose a case from an AI-standardized patient library, receive a “door note” with chief complaint and vital signs, and then interview the AI-SP by chat or voice. Chat uses GPT-4o; voice uses OpenAI’s Realtime API with built-in transcription. After the encounter, the system evaluates the transcript using GPT-4o with rubric-based feedback centered on the Master Interview Rating Scale, described here as covering 28 communication competencies on a 5-point scale. In a pilot with 104 first-year medical students, all students completed at least one conversation; mean conversation length was about 19.9 minutes, with 38.7 dialogue turns on average, about 609 words from the student and about 998 words from the AI-SP. Among 28 survey respondents, 78\% most valued focused history-taking practice, 62\% question phrasing practice, and 53\% the automated feedback, while the paper also notes that the Learning Hub’s self-regulated learning tools were underused and usage rose mainly right before the OSCE (Hicke et al., 1 Mar 2025).
Other systems push beyond text into multimodal role-play. "Synthetic Patients: Simulating Difficult Conversations with Multimodal Generative AI for Medical Education" models telehealth-style encounters for palliative care, goals-of-care, and end-of-life discussions. It combines GPT-4 dialogue with generated patient imagery, cloned voices, and lip-synced video inside a custom Python-based web app. The interaction pipeline is explicitly audio record WhisperAPI transcription OpenAI inference API ElevenLabs voice generation lip-sync engine synchronized audiovisual clip. The reported direct development cost is about \$150**, with ongoing hosting costs of roughly **\$500–\$2000/month, but the authors emphasize substantial labor cost and serious latency constraints: the best-quality lip-syncing option used in demonstration took 10–20 minutes, and the open-source real-time lip-sync approach still required 20–30 seconds per response (Chu et al., 2024).
VR-based conversational patients extend this direction into embodied interaction. CLiVR is a client-server system built in Unity and deployed on the Meta Quest 3, using Ready Player Me avatars, uLipSync for MFCC-based lip sync, OpenAI Whisper for transcription, a LLM such as Gemini 2.0 Flash for response generation, and Amazon Polly neural voices for playback. The reported mean round-trip delay is about 1.35 seconds per turn, broken down into 0.14 s speech recognition, 0.56 s LLM generation, 0.24 s TTS, and 0.30 s sentiment inference. In an IRB-approved mixed-methods faculty study, 18 volunteers consented, 15 completed sessions, and 13 completed the post-session survey; 12 of 13 agreed that integrating LLMs with VR would be beneficial for simulating patient-doctor interactions, while the item about replacing standardized patients had a mean of 2.38/5, reinforcing the paper’s supplement-not-substitute framing (Amithasagaran et al., 21 Oct 2025).
A further development is experience accumulation rather than one-shot prompting. EvoPatient frames simulated standardized patients as products of agent coevolution: doctor agents and a patient agent interact in multi-turn dialogues, validated trajectories are stored, and two libraries—an Attention Library and a Trajectories Library—are reused to improve future questions and answers without weight updates. The reported patient answer scores are Relevance 0.7589, Faithfulness 0.8786, Robustness 0.9412, and Ability 0.8597, with the paper claiming more than 10\% improvement in requirement alignment over existing reasoning methods and an effective balance of resource consumption after evolving over 200 cases for 10 hours (Du et al., 2024).
4. Persona grounding, controllability, and structured memory
A major methodological trend is the move from generic “act like a patient” prompting toward simulators with explicit personas, retrieval, memory, and control modules. CureFun exemplifies this shift. It converts standardized-patient scripts into a structured case graph containing entities, relations, attributes, and attribute values, then uses the four-step ERRG pipeline—Extract, Retrieve, Rewrite, Generate—to produce patient responses grounded in that graph. The system is described as model-agnostic and was tested with GPT-3.5-turbo, PaLM, ERNIE-4 / ERNIE-Bot-4, Mixtral-8x7B, Qwen-72B, and Llama-based checkpoints. In expert Chatbot Arena-style comparisons limited to 20 rounds, CureFun improved every backbone model; for example, GPT-3.5-turbo rose from 1403.54 to 1653.72 in B-ELO, and automated dialogue scoring over 80 records across 8 cases showed average correlations of about Spearman and Pearson 0 with human evaluators (Li et al., 2024).
PatientSim pushes controllability further through a persona space defined on four axes: personality, language proficiency, medical history recall level, and cognitive confusion level. Clinical profiles are built from MIMIC-IV, MIMIC-IV-ED, and MIMIC-IV-Note, yielding 170 patient profiles with 24 items each. The personality axis includes neutral, distrustful, impatient, overanxious, overly positive, and verbose; language proficiency is mapped to CEFR levels A, B, and C; recall is high or low; confusion is normal or high. Although the naive cross-product is 1, the paper restricts the “high confusion” case to a single special configuration, giving 37 distinct personas. The top model, Llama 3.3 70B, was validated by four clinicians, with an average overall quality score of 3.89 / 4, plausibility 3.91 / 4, and educational-use rating 3.75 / 4 (Kyung et al., 23 May 2025).
French OSCE simulation introduces another form of control by deriving dialogue structure from assessment criteria. The system uses up to four distinct LLM instances for retrieval, generation, control, and correction, grounded in a physician sheet, patient sheet, and evaluator sheet. The evaluation checklist is transformed into the OIAP phases—Opening and preparation, Information collection, Assessment, Plan and conclusion—and generation proceeds in batches of up to four criteria with up to three dialogue turns per batch. Responses with controller score above 8/10 are accepted; lower-scoring responses enter a correction loop. On the reported experiments, the reflection loop was triggered in only 3.6\% of patient turns, and when triggered, correction improved the controller score in 79.5\% of cases (Bonzi et al., 26 Jun 2026).
MSPRP and AIPatient represent two complementary solutions to the same control problem. MSPRP decomposes patient role-play into three stages—Basic Information Generation, Communication Style Injection, and Expression Consistency Regulation—over a five-dimensional persona vector of Personality, Emotion, Medical History Recall, Medical Comprehension, and Language Fluency. On Ch-PatientSim, the full Stage 1 + Stage 2 + Stage 3 order improved baseline Qwen2.5-7B from 3.748 to 3.905 in Persona Consistency and from 3.824 to 3.942 in Naturalness, while Qwen2.5-72B + MSPRP reached 3.939 Persona Consistency and 3.970 Naturalness (Jiang et al., 16 Jan 2026). AIPatient instead uses an EHR-grounded Knowledge Graph built from 1,500 patient-admission records, 15,441 nodes, and 26,882 edges, together with six LLM-powered agents—Retrieval, KG Query Generation, Abstraction, Checker, Rewrite, Summarization—to answer as a patient with an overall QA accuracy of 94.15\%, knowledge-base validity F1 = 0.89, median Flesch Reading Ease 77.23, and median Flesch-Kincaid Grade 5.6 (Yu et al., 2024).
5. Patient simulators as evaluation environments and risk probes
Patient simulators are increasingly used not to train humans, but to evaluate other AI systems under controlled conversational variation. SAPS, the State-Aware Patient Simulator in the AIE framework, models 10 categories of states/actions: Initialization, effective/ineffective/ambiguous inquiry, effective/ineffective/ambiguous advice, Demand, Other Topics, and Conclusion. Its architecture combines a state tracker, memory bank, and response generator so that doctor messages are classified before the simulator retrieves long-term patient information, state-specific requirements, and dialogue history. The paper reports a test set based on 50 real hospital cases, 10 turns of dialogue each, and 4000 test questions, concluding that SAPS performs closer to humans than alternative patient simulators and enables clinically meaningful distinctions among doctor LLMs (Liao et al., 2024).
The triage-focused Patient Simulator in "AI Agents for Conversational Patient Triage" uses deidentified HealthVerity records to derive 519 encounters from 21,779 deidentified records spanning May 1, 2021 to April 30, 2024, with balanced coverage over nine symptom categories after excluding Psychological cases. Two clinicians with a combined ~50 years of experience reviewed the simulations. The simulator was judged consistent with the vignettes in 97.7\% of cases, and the extracted case summary was reported as 99\% relevant in the abstract and 99.2\% in the results section. Clinicians also judged that the most likely diagnosis was among the top three proposed diagnoses in 95.4\% (495/519) and 94.8\% (492/519) of cases for the two reviewers, with Cohen’s 2 values of 0.79, 0.74, and 0.72 for model-physician and physician-physician agreement (Rashidian et al., 4 Jun 2025).
Risk-oriented simulation introduces systematic patient variation as an auditing instrument. The antidepressant-selection simulator grounded in the NIST AI Risk Management Framework combines three orthogonal profile dimensions: medical profiles from the All of Us Research Program Registered Tier v8, linguistic profiles varying health literacy and condition-specific language, and behavioral profiles such as Structured & Cooperative, Distracted & Unfocused, and Adversarial & Combative. Across 500 conversations, human annotators assessed 1,787 medical concepts across 100 conversations, obtaining F1 = 0.94, 3, while the LLM judge achieved F1 = 0.94, 4 with paired bootstrap 5. The central result is a monotonic degradation in rank-one concept retrieval accuracy across the health literacy spectrum: 47.9\% for Limited, 69.1\% for Functional, and 81.6\% for Proficient (Shawon et al., 11 Feb 2026).
The same logic has been extended beyond doctor-patient interviews to caregiving interactions. The dementia ADL simulator uses gpt-5-mini to generate multi-turn behavior conditioned on dementia severity, care setting, time in setting, and ADL task, while experts rate each turn on a 1–5 realism scale and respond as caregivers through free text or one of four strategy-scaffolded suggestions: Recognition, Negotiation, Facilitation, and Validation. In an IRB-approved formative study with 14 dementia-care experts, 18 sessions, and 112 rated turns, custom responses accounted for 54.5\% of turns, and critique analysis yielded a six-category failure-mode taxonomy led by Task/ADL grounding error, which represented 9 of 20 commented turns, or 45\% (Gangaraju et al., 6 Mar 2026).
6. World models, multi-agent clinical ecosystems, and digital twins
A different meaning of patient simulator appears in reinforcement learning and physiological modeling, where the simulator must predict how patient state evolves after interventions. "Sepsis World Model" builds an OpenAI Gym-compatible simulator from MIMIC sepsis trajectories using a Variational Auto-Encoder and an MDN-RNN. The observed state is a 46-dimensional normalized feature vector, the latent size is 30 dimensions, and the action space comprises 25 possible treatment actions based on vasopressor and IV fluid dosage quantiles. The underlying objective is to model the stochastic transition 6 in latent space rather than in raw EHR space. The VAE was trained for 20 epochs and achieved final reconstruction loss 0.0791; the paper reports that MDN-based rollouts were closer to real trajectories than plain RNN rollouts and were usable as environments for Deep Q-Learning (Kiani et al., 2019).
Organ-Agents generalizes this state-transition view into a multi-agent virtual physiology simulator. It decomposes human physiology into nine interacting organ/system agents—Respiratory, Blood, Coagulation, Immune, Nervous, Cardiovascular, Hepatic, Renal, and Metabolic/endocrine—supported by an Analyzer, Correlator, and Compensator. The system is trained on 7,134 sepsis patients and 7,895 matched controls, with external validation on 22,689 ICU patients from two hospitals. The abstract reports high simulation accuracy on 4,509 held-out patients, with per-system MSEs <0.16, and the details section gives a system-wide average MSE = 0.12. Pathway simulation accuracy is reported as 0.86 for hypotension, 0.79 for hyperlactatemia, and 0.84 for hypoxemia, with mean trigger time deviation below 1.9 hours and clinician realism ratings around 3.8–4.2 on a 5-point Likert scale (Chang et al., 20 Aug 2025).
MedAgentSim sits between dialogue simulation and physiological environment modeling. It couples a Patient agent, Doctor agent, and Measurement agent so diagnosis emerges through multi-turn conversation and selective ordering of tests such as temperature, blood pressure, ECG, blood tests, X-rays, and MRI. The patient can operate in Generation Mode or Dataset Mode, the doctor begins with no full report, and the measurement agent only reveals results when requested. The framework also maintains a Medical Records Buffer and an Experience Records Buffer, uses KNN retrieval over CLIP embeddings, and applies multi-agent discussion, chain-of-thought reasoning, and majority-vote ensembling. This suggests a broader definition of patient simulation as part of a hospital-like interactive ecosystem rather than a standalone patient responder (Almansoori et al., 28 Mar 2025).
A plausible implication is that the field is converging on two partly separate technical traditions. One tradition models dialogue realism and pedagogical value; the other models state realism and counterfactual trajectory generation. Current papers rarely unify both at high fidelity.
7. Evaluation frameworks, recurrent limitations, and emerging directions
As the number of simulators has increased, evaluation itself has become a central research problem. PatientHub addresses fragmentation by standardizing the definition, composition, and deployment of simulated patients. It separates experiments into reusable abstractions for clients, generators, event graphs, and evaluators; supports binary, scalar, categorical, and extraction-based judgments; and is implemented in Python with Hydra, Burr, JSON characters, YAML/Jinja prompts, LiteLLM, and Pydantic. The paper reports implementations of 11 representative patient simulators, demonstrates synthetic persona generation from 20 ESC conversations, and benchmarks methods under a shared therapy session event with a maximum of 15 turns and a moderator warning at the 13th turn. The reported benchmark covers 20 synthetic profiles and 280 conversation sessions, but the paper also notes that all metrics rely on LLM-as-a-judge using GPT-4o and that human clinician validation remains necessary (Sabour et al., 12 Feb 2026).
PSI-Bench makes a related point for depression simulators: realism is not a single score. Its framework evaluates turn-level response length, lexical diversity, and linguistic markers of depression; dialogue-level Narrative-Emotion Processes and emotion expression over time; and population-level variability. On simulated conversations benchmarked against the Eeyore real-patient dataset, the paper finds that simulators produce responses far longer than real patients—about 64 to 319 words/message versus about 18 words/message—and tend to be uniformly too lexically diverse, emotionally over-resolved, and insufficiently variable across the population. A human study with 20 mental health experts showed strong alignment with benchmark judgments, with 7 for pairwise realism comparisons, 8 for NEP classification, and 9 for emotion classification (Hoang et al., 28 Apr 2026).
Recurring limitations are strikingly consistent across the literature. Many studies are single-center pilots, rely on small expert or learner samples, or provide only short-horizon validation. Several systems remain text-only or lack robust nonverbal realism; multimodal systems still report latency, cloning artifacts, or lip-sync distortion; evaluation pipelines often depend heavily on LLM judges; and controllability or realism can degrade under distribution shift, adversarial questioning, or low-health-literacy language (Chu et al., 2024, Shawon et al., 11 Feb 2026, Chang et al., 20 Aug 2025). Patient-facing procedural simulators additionally face localization, workflow-integration, and cost-effectiveness questions, while physiological simulators still lack ground-truth optimal policies or prospective causal validation (Smith et al., 3 Oct 2025, Kiani et al., 2019).
The current literature therefore supports a restrained conclusion. Patient simulators have become a broad technical class spanning mixed reality, structured role-play, graph-grounded retrieval, multi-agent coevolution, risk auditing, and digital-twin-like physiology emulation. Their practical value lies in scalability, repeatability, controlled variation, and privacy-preserving experimentation. Their unresolved problem is fidelity: not only whether a simulator is fluent or accurate on average, but whether it behaves like the right kind of patient, in the right context, with the right variation, at the right time.