Synthetic Electronic Health Records (EHRs)

Updated 10 January 2026

Synthetic EHRs are algorithmically created patient records that replicate real-world statistical, temporal, and clinical features while safeguarding privacy.
Advanced generative methods—including GANs, VAEs, transformers, and diffusion models—balance data fidelity, utility, and privacy risks.
Evaluation frameworks use metrics like distributional similarity, predictive utility, and formal privacy measures to guide model selection and development.

Synthetic Electronic Health Records (EHRs) are algorithmically generated representations of patient-level healthcare data that closely mimic the statistical, temporal, and clinical properties of real-world EHRs without reproducing any individual’s actual record. Synthetic EHRs address data-access barriers stemming from privacy regulations, enabling method development, benchmarking, and data sharing for both machine learning and clinical research, provided they maintain high fidelity to real data and minimal privacy risk (Yan et al., 2022, Chen et al., 2024). Approaches to synthetic EHR generation have evolved from rule-based procedural simulators to state-of-the-art deep generative models, including GANs, VAEs, autoregressive Transformers, and (increasingly) diffusion models with discrete or continuous architectures. Modern evaluation frameworks emphasize not only data fidelity and downstream utility, but also formal privacy risk metrics and computational feasibility.

1. Generative Methodologies and Architectures

Synthetic EHR generation encompasses a spectrum of methodologies, each with specific strengths and limitations regarding fidelity, scalability, temporal modeling, and privacy risk:

Rule-Based Simulation: Early synthetic EHR generators, such as Synthea and Plasmode, encode expert-driven state machines or epidemiological models to generate patient histories with realistic demographic and disease-event distributions. These approaches excel in privacy protection (as no real data is used) but cannot reproduce the detailed multivariate dependencies or rare co-occurrence patterns present in actual EHRs (Chen et al., 2024).
Generative Adversarial Networks (GANs): MedGAN, EMR-WGAN, CorGAN, and related architectures learn to mimic binary or multi-hot representations of patient records via adversarial training of generator and discriminator models. Variants incorporating autoencoding (MedGAN) or correlation-matching (CorGAN) improve marginal and pairwise fidelity (Chen et al., 2024). GANs can be extended to support differential privacy (DP) via DP-SGD (as in DPGAN) (Chin-Cheong et al., 2020), or to federated settings for cross-institutional collaboration (Weldon et al., 2021). GAN methodologies are prevalent for static binary/categorical data but are susceptible to mode collapse and struggle with high-dimensional or sequential datasets (Sun et al., 2023, Yan et al., 2022).
Variational Autoencoders (VAEs) and Conditional VAEs: VAEs provide a probabilistic latent representation of EHRs, optimizing an evidence lower bound (ELBO) and supporting both unconditional (EVA) and conditional (disease-specific) generation (Biswal et al., 2020, Muller et al., 2022). VAEs typically outperform GANs in absolute uniqueness (low risk of record duplication) but may produce samples less tightly matching the high-probability regions of the data, especially for rare events.
Autoregressive Transformer Models: PromptEHR and CEHR-GPT frame longitudinal EHR synthesis as a next-token prediction task over tokenized sequences of clinical events, concepts, and time gaps, leveraging deep self-attention to capture long-range dependencies (Theodorou et al., 2023, Karami et al., 2024, Pang et al., 2024, Pang et al., 3 Sep 2025). Incorporation of time-token frameworks and auxiliary objectives (Time Decomposition, Time-To-Event) enables precise modeling of patient timelines, visit granularities, and complex cohort definitions (Pang et al., 3 Sep 2025). Transformer-based models can natively support multimodality and vocabulary adaptation.
Diffusion Models: Denoising diffusion probabilistic models (DDPMs) and their discrete (EHR-D3PM), tabular (TabDDPM), and predictive variants (EHRPD) have demonstrated exceptional fidelity, stable training, and robust privacy properties for both static and temporal EHR synthesis (Ceritli et al., 2023, Han et al., 2024, Tian et al., 2023, Zhong et al., 2024). These models combine Gaussian and multinomial diffusion processes, supporting both mixed-type and high-dimensional tabular data. Notably, conditional sampling, class-guidance, and accelerated samplers permit controllable and efficient generation. Diffusion approaches are increasingly state-of-the-art, particularly for time-series generation and imputation (Tian et al., 2023, Zhong et al., 2024).
Hybrid and Multi-Modal Models: Recent progress integrates structured/temporal EHR generation with unstructured text synthesis (narrative notes, reports) using multi-generator deliberation—such as MSIC, which models both latent health states and generates per-visit clinical narratives via transformer-based text modules (Sun et al., 2023, Lee, 2018). RawMed employs minimal preprocessing and residual quantization for multi-table, longitudinal EHRs, demonstrating strong performance across diverse data types (Cho et al., 9 Jul 2025).

2. Data Representation and Preprocessing

The efficacy of synthetic EHR models depends critically on data representation, encoding strategy, and preprocessing discipline:

Tabular Binary/Categorical: Most GAN and VAE models operate on one-hot, multi-hot, or binary vectors aggregated over diagnosis/procedure/medication codes, demographic fields, and (optionally) time-binned events (Yan et al., 2022, Karami et al., 2024). High-dimensional sparsity is addressed via feature grouping or low-frequency code filtering.
Time Series and Sequential Encoding: For longitudinal and time-series EHR, patient histories are encoded as sequences of visits (collections of codes/events) with explicit time-gap tokens or age/interval features (Theodorou et al., 2023, Tian et al., 2023, Pang et al., 3 Sep 2025). Modern architectures include imputation and missingness encoding (Tian et al., 2023).
Tokenization: Transformer-based models employ explicit tokenization for discrete codes, numerical bins (for quantized labs/vitals), and irregular time gaps (as in SynEHRgy, CEHR-GPT) (Karami et al., 2024, Pang et al., 3 Sep 2025). This enables mixed-type, multi-modality, and scalable representations.
Quantization and Compression: Residual quantization (RawMed) compresses long textualized event sequences, facilitating training on multi-table, longitudinal formats with minimal preprocessing (Cho et al., 9 Jul 2025).

3. Evaluation Metrics: Fidelity, Utility, and Privacy

Benchmarking frameworks for synthetic EHRs now employ a comprehensive suite of metrics to characterize model performance (Yan et al., 2022, Chen et al., 2024, Karami et al., 2024):

Fidelity (Distributional Similarity): Assessments span dimension-wise prevalence/error (RMSE, MMD, KS), pairwise and higher-order correlations (CFD, PCC), code co-occurrence (bigram/trigram), and joint event distributions. Maximum Mean Discrepancy (MMD) and Jensen-Shannon divergence are standard for both marginal and joint fidelity.
Downstream Utility: Synthetic datasets are evaluated by training classifiers (e.g., LightGBM, XGBoost, neural nets) for outcome prediction, phenotyping, or disease onset, and benchmarking test-set AUROC, AUPRC, and recall@k versus real data. Additional metrics include train-on-synthetic/test-on-real (TSTR) and synthetic data augmentation impact (Tian et al., 2023, Biswal et al., 2020).
Privacy Risk: Risk is assessed using membership inference attack F1, nearest neighbor adversarial accuracy (NNAA), attribute inference, and meaningful identity disclosure as formalized by El Emam et al. These metrics reflect the probability that a synthetic record is traceable to a real patient, or that sensitive attributes can be inferred given quasi-identifiers. High uniqueness (low duplication) and distance-to-closest-record (DCR) metrics are also used (Muller et al., 2022, Chen et al., 2024).
Computational Efficiency and Scalability: Training and sampling time, memory footprint, and hardware requirements are reported in benchmarking studies, with diffusion and large transformer models being the most computationally demanding (Cho et al., 9 Jul 2025, Karami et al., 2024).

4. Privacy–Utility Trade-offs and Model Selection

A fundamental trilemma governs synthetic EHR generation: maximizing fidelity and downstream utility inevitably increases statistical proximity to real records and hence privacy risk (Muller et al., 2022, Yan et al., 2022). Empirical benchmarks demonstrate that:

Rule-based simulators provide maximal privacy but inferior utility.
GAN- and diffusion-based models yield the highest fidelity and predictive utility, but require postprocessing or DP to mitigate membership risk (Chen et al., 2024, Ceritli et al., 2023).
VAE-based generators balance diversity/uniqueness and privacy but can lag in capturing rare events.
Transformer/diffusion hybrids excel in high-dimensional, temporal, and mixed-modality settings, often matching or exceeding GANs in both fidelity and privacy per established metrics (Karami et al., 2024, Han et al., 2024).
Application-specific model selection is best informed by contextual weighting of metrics (as implemented in SynthEHRella and the rank-aggregated framework of (Chen et al., 2024)), with decision trees guiding choices based on the presence of distributional shift, downstream utility demands, and privacy constraints.

5. Advanced and Emerging Directions

Recent advances and open challenges in synthetic EHR generation include:

Conditional and Controlled Generation: Plug-and-play classifier guidance in diffusion models (EHR-D3PM) and disease-specific VAEs (EVA₍c₎) enable targeted cohort oversampling and group-based fairness enhancement (Han et al., 2024, Biswal et al., 2020, Lin et al., 20 Apr 2025). Fine-tuning transformers on external vocabularies or new sites further enhances transportability (Pang et al., 3 Sep 2025).
Longitudinal and Multi-Table Synthesis: Multi-visit generators (e.g., HALO, MSIC) and minimal-preprocessing frameworks (RawMed) address temporality, event heterogeneity, and inter-table dependencies, approximating raw EHR schemas with minimal information loss (Theodorou et al., 2023, Sun et al., 2023, Cho et al., 9 Jul 2025).
Text Generation and Fusion: Encoder–decoder models generate clinical narratives (chief complaints, progress notes), integrated with code-driven GANs to support both structured and unstructured data synthesis (Lee, 2018, Sun et al., 2023, Sharoff et al., 3 Jan 2026).
Formal Privacy Guarantees: Differentially private training protocols (DP-SGD, Rényi DP) are implemented for GANs and diffusion models, with theoretical and stringent empirical audits recommended before data release (Chin-Cheong et al., 2020, Ceritli et al., 2023, Chen et al., 2024).
Benchmarking, Open-Source Software, and Reproducibility: SynthEHRella offers unified interfaces and standardized metrics for multi-method benchmarking (Chen et al., 2024). Community datasets and codebases (RawMed, CEHR-GPT) facilitate robust comparative analysis and downstream research.

6. Limitations, Open Challenges, and Future Directions

Persistent limitations include:

Difficulty in modeling extreme high-dimensionality (rare code combinations, >10,000 variables), patient-attention across long histories, and irregular, sparse time-series data (Theodorou et al., 2023, Karami et al., 2024).
Disconnect between statistical fidelity and true clinical safety; synthetic EHRs may retain implausible event sequences or erroneous medication recommendations, with bias and over-modernization arising from LLM-based approaches (Sharoff et al., 3 Jan 2026).
Scaling privacy defenses without eroding fidelity remains unresolved; differential privacy degrades downstream predictive performance, and current empirical audits may not capture all leakage channels (Chin-Cheong et al., 2020, Muller et al., 2022).

Future directions comprise multimodal integration (EHR + imaging + narrative), scalable conditional modeling for rare diseases or subpopulations, accelerated samplers and hybrid architectures (autoregressive-diffusion, Transformer-score models), and rigorous, composable privacy analysis with domain-validated clinical plausibility layers (Cho et al., 9 Jul 2025, Chen et al., 2024, Zhong et al., 2024, Sun et al., 2023).

References