MedSynth: Synthetic Medical Data
- MedSynth is a suite of methodologies that generates realistic synthetic data across medical imaging, clinical text, dialogues, and molecular synthesis.
- It integrates advanced techniques like diffusion models, LLMs, and transformer-based graph modeling to ensure anatomical plausibility, prompt-driven control, and synthetic feasibility.
- MedSynth supports scalable, privacy-preserving research by providing open datasets and benchmarks for data augmentation, model pre-training, and translational medicine.
MedSynth encompasses a family of synthetic data generation methodologies and datasets for medicine, spanning high-fidelity 3D medical images, medical text, clinical dialogues, and synthesizable chemical and molecular structures. Multiple independent systems and datasets published under the MedSynth name aim to advance data-driven research and downstream applications by providing scalable, privacy-compliant, and clinically realistic synthetic data, often with a focus on guarantees such as synthetic feasibility, anatomical plausibility, or fine-grained label controllability.
1. High-Fidelity 3D Medical Image Synthesis
MedSynth (xu et al., 2023) targets the generation of high-resolution 3D volumetric chest CT images, jointly conditioned on free-text radiology reports and anatomical segmentation masks. The methodology addresses key limitations of conventional GAN or diffusion-based models—namely, memory constraints at clinical image resolution (256³ voxels), and the tendency to “hallucinate” fine anatomical detail unless explicitly constrained.
Hierarchical synthesis decomposes the generation pipeline into:
- A text-guided low-resolution (64³) DDPM, where input noise and radiology-report embeddings (from Medical BERT) are mapped through a 700M-parameter 3D UNet with cross-attention at the latent level. Joint output comprises 4 channels per voxel: CT intensity, lung lobes, airway, and vessel segmentation.
- A super-resolution upsampler (256³) diffuses upsampled noise and base-generated volumes through a lightweight 3D UNet, reconstructing the final 3D CT and mask channels.
Training uses 8,752 de-identified 3D CTs (1 mm³, HU normalized to [–1,1]) with paired/unpaired radiology reports. No explicit cross-entropy/Dice loss is imposed; a single 4-channel loss regularizes all outputs. Anatomical priors are enforced by simultaneous reconstruction of segmentation channels precomputed with dedicated tools (lungmask, NaviAirway, TotalSegmentator).
MedSynth achieves state-of-the-art FID (0.009) and MMD (0.019) metrics versus prior GAN/diffusion approaches, with airway/lobe Dice scores of 0.75–0.77. The approach uniquely preserves fissures, fine airways, and vessels. Controlled anatomy-driven synthesis is supported by masking segmentation outputs at inference, yielding prompt-guided volumetric segmentation with competitive Dice (prompt: 0.75; vanilla: 0.70). Conditioning on pathology text (e.g., “large pleural effusion” vs. negative prompt) modulates anatomic outcomes in expected directions.
2. Synthetic Medical Text Generation Using LLMs
MedSyn (Kumichev et al., 4 Aug 2024) provides an LLM-based framework for generating synthetic clinical notes, particularly tailored to low-resource languages such as Russian. The pipeline sequentially:
- Constructs a Medical Knowledge Graph (MKG) from WikiMed, capturing disease (ICD-10), drug, and symptom associations.
- Samples symptoms () as prior information for a target ICD code .
- Retrieves an in-domain example and assembles a prompt (Task, ICD code, symptoms, example).
- Generates synthetic notes using GPT-4 or a LoRA-fined-tuned LLaMA-7b model.
Fine-tuning is performed on 152,000 Russian instruction-following samples, including clinical notes, MKG-based QA, ChatGPT syntheses, and error-correction data. The system releases an open dataset of 41,185 synthetic notes covering 219 ICD-10 codes.
Evaluation on the RuMedTop3 dataset shows that upsampling rare/cold codes with synthetic data improves ICD code (second-level) prediction significantly: up to 17.8% hit@1 improvement for specific rare classes, with modest gains (0.7%–1.1%) in overall hit@1 depending on model. MKG-guided symptom sampling reduces hallucinations, though the lack of demographic and symptom co-occurrence conditioning is a noted limitation.
3. Synthetic Medical Dialogue–Note Pairs for Clinical Documentation
MedSynth (Mianroodi et al., 2 Aug 2025) introduces a large-scale, privacy-compliant dataset of 10,035 fully synthetic doctor–patient dialogues paired with clinical notes, systematically covering >2,000 ICD-10 codes. Data generation employs multi-agent LLM pipelines, leveraging GPT-4o agents for scenario creation, quality control, SOAP note writing, and dialogue generation.
Each pair contains:
- A dialogue (mean 932 tokens, 55 sentences) and
- A SOAP-formatted note (mean 621 tokens, 23 sentences).
Top 2,000 ICD-10 codes (by real-world claims frequency) are sampled, with five unique pairs per code. Agents enforce coverage across 13 clinical scenario variables and enforce strict SOAP adherence.
Evaluation via LLM-based jury (Prometheus, GPT-4o, Qwen2.5) demonstrates superior realism, medical correctness, and SOAP compliance when compared to prior synthetic dialogue datasets (e.g., NoteChat), with consistent improvements in both Dial-2-Note and Note-2-Dial tasks:
- In head-to-head tests, MedSynth-based models are preferred in 60–95% of cases (task and training set dependent).
- Scenario Judge agent and backbone choice (GPT-4o) are critical for output quality.
The dataset is open-access, privacy-compliant, and intended for pre-training, EHR automation, and telemedicine simulation. Coverage of rare clinical language and full diversity of real-world note formats remain open challenges.
4. Synthesis of Synthesizable Molecules and Synthetic Routes
The MedSynth label has been adopted for generative frameworks that address molecular synthesizability—a critical constraint for translational medicinal chemistry.
Projecting Molecules into Synthesizable Chemical Spaces (Luo et al., 7 Jun 2024) A transformer-based graph-to-sequence model maps arbitrary molecular graphs to postfix synthetic routes (building blocks and reaction templates), ensuring all generated structures are guaranteed synthesizable via stack machine simulation. Key outcomes:
- Success/valid parse rates up to 98.8% (held-out ChEMBL), with mean Morgan/Scaffold/Pharmacophore similarities exceeding 0.55 on analog generation tasks.
- Out-of-distribution “rescued” analog search: 70% of “unsynthesizable” molecules can be projected into accessible chemical space, matching target properties within 0.25–0.8 in primary objectives.
SynLlama (Sun et al., 16 Mar 2025) Meta Llama3 LLMs fine-tuned on 100k–2M enumerated retrosynthetic routes predict full multi-step synthetic plans (SMARTS templates, reactants, BBs) for target molecules and analogs. SynLlama supports generalization to unseen building blocks, reconstructs 87–97% of ChEMBL/held-out BB targets, and delivers hit expansion with free-energy perturbation (FEP) errors kcal/mol. The model interface is prompt-based and suitable for “hit-to-lead” and post-processing workflows.
SynthFormer (Jocys et al., 3 Oct 2024) An EGNN–Transformer hybrid, providing 3D-pharmacophore–conditioned generation and construction of full synthetic route trees as alternating building-block and reaction tokens. Measures show lower fingerprint similarity but higher pharmacophore similarity (0.20–0.43) to seed ligands, and 11% of generated analogs improve upon original docking scores.
SynCoGen (Rekesh et al., 16 Jul 2025) A unified masked diffusion/flow framework generating synthesizable 3D structures. Co-samples reaction graphs and atomic coordinates, enabling geometry-conditioned, fully retrosynthetically-accessible proposals. Achieves 96.7% RDKit validity, 72% retrosynthesis solve rate, and competitive performance in zero-shot fragment linking and analog expansion.
5. Applications and Limitations
MedSynth systems and datasets enable development and benchmarking in:
- Data augmentation for rare pathology or low-resource ML settings in imaging, clinical NLP, and signal processing (xu et al., 2023, Kumichev et al., 4 Aug 2024, Mianroodi et al., 2 Aug 2025, Martin et al., 21 Mar 2025).
- Pre-training and fine-tuning of downstream models for ICD code prediction, medical note/summary generation, dialogue agents, and event sequence modeling (Kumichev et al., 4 Aug 2024, Mianroodi et al., 2 Aug 2025, Gao et al., 11 Sep 2024).
- Benchmarking and improving synthesizable molecule generation, analog expansion, and lead optimization in cheminformatics (Luo et al., 7 Jun 2024, Jocys et al., 3 Oct 2024, Sun et al., 16 Mar 2025, Rekesh et al., 16 Jul 2025).
- Privacy-preserving research by providing open, non-PII synthetic datasets; key for sharing, federated learning, and OOD detection (xu et al., 2023, Kumichev et al., 4 Aug 2024, Mianroodi et al., 2 Aug 2025, Martin et al., 21 Mar 2025, Gao et al., 11 Sep 2024).
Noted limitations across domains include:
- Reliance on pre-computed segmentors or knowledge graphs, potentially propagating upstream errors.
- Limitations in capturing microscopic or rare features (imaging), non-SOAP clinical note types (NLP), demographic context (text), and incomplete simulation of clinical audio/disfluency artifacts (dialogue).
- Absence of formal differential privacy guarantees for some datasets (Gao et al., 11 Sep 2024); privacy is often empirically assessed.
- Model uncertainty, error propagation, and clinical validation for high stakes deployment remain active areas of research.
6. Comparative Table of MedSynth Domains and Core Properties
| Domain | Core MedSynth Resource/Model | Synthetic Guarantee | Coverage/Scale | Open Access |
|---|---|---|---|---|
| 3D Imaging | Diffusion+UNet, Anatomy Masks (xu et al., 2023) | Anatomical plausibility | 8,752 CT, 209k reports | None specified |
| Medical Text | MKG+LLM (Kumichev et al., 4 Aug 2024) | Prompt-driven factuality | 41,185 notes/219 ICD | Yes (HuggingFace) |
| Dialogues/Notes | LLM-Agent Multi-Step (Mianroodi et al., 2 Aug 2025) | SOAP compliance | 10,035 pairs/2001 ICD | Yes (HuggingFace) |
| Molecule Synthesis | StackTr+LLM/EGNN+Trans/Diffusion (Luo et al., 7 Jun 2024, Jocys et al., 3 Oct 2024, Sun et al., 16 Mar 2025, Rekesh et al., 16 Jul 2025) | Route-exact synthesis | 10³⁰ synthetic molecules (space) | Models/code open |
| Clinical Trajectories | VAE+Neural Hawkes (Gao et al., 11 Sep 2024) | Empirical privacy, high-fidelity | 7 oncology trials | Not stated |
7. Significance and Outlook
MedSynth, as a collection of techniques and datasets, enables comprehensive, privacy-compliant, and controllable remote generation of medical and chemical data across imaging, NLP, and cheminformatics. By harmonizing LLMs, specialized diffusion/generative frameworks, and structured domain knowledge (segmentations, pharmacophores, knowledge graphs), MedSynth systems address the principal challenges in modern computational medicine and drug design: fidelity, synthetic feasibility, and clinical usefulness. Continued developments target deeper integration of multimodal inputs, extension to rare disease/disorder simulation, and formal privacy guarantees—positioning MedSynth as a key resource for future “self-driving” biomedical AI research and translation (xu et al., 2023, Kumichev et al., 4 Aug 2024, Mianroodi et al., 2 Aug 2025, Luo et al., 7 Jun 2024, Sun et al., 16 Mar 2025, Jocys et al., 3 Oct 2024, Rekesh et al., 16 Jul 2025, Gao et al., 11 Sep 2024, Martin et al., 21 Mar 2025).