Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

194 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

45 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

A Scoping Review of Synthetic Data Generation for Biomedical Research and Applications (2506.16594v1)

Published 19 Jun 2025 in cs.CL

Abstract: Synthetic data generation--mitigating data scarcity, privacy concerns, and data quality challenges in biomedical fields--has been facilitated by rapid advances of LLMs. This scoping review follows PRISMA-ScR guidelines and synthesizes 59 studies, published between 2020 and 2025 and collected from PubMed, ACM, Web of Science, and Google Scholar. The review systematically examines biomedical research and application trends in synthetic data generation, emphasizing clinical applications, methodologies, and evaluations. Our analysis identifies data modalities of unstructured texts (78.0%), tabular data (13.6%), and multimodal sources (8.4%); generation methods of prompting (72.9%), fine-tuning (22.0%) LLMs and specialized model (5.1%); and heterogeneous evaluations of intrinsic metrics (27.1%), human-in-the-loop assessments (55.9%), and LLM-based evaluations (13.6%). The analysis addresses current limitations in what, where, and how health professionals can leverage synthetic data generation for biomedical domains. Our review also highlights challenges in adaption across clinical domains, resource and model accessibility, and evaluation standardizations.

Summary

The paper systematically reviews 59 studies on LLM-driven synthetic data generation for biomedical research, outlining key methodologies and evaluation metrics.
It details prompt-based and fine-tuning approaches across text, tabular, and multimodal data, highlighting practical uses like clinical note synthesis and disease prediction.
Findings demonstrate synthetic data’s potential to enhance model performance, address privacy issues, and drive standardization in biomedical evaluations.

Synthetic Data Generation for Biomedical Research: A Scoping Review

This paper presents a systematic review of the recent advancements in synthetic data generation for biomedical research, with a focus on the utility, methodologies, and evaluation strategies of LLMs for this purpose. Covering 59 studies published between 2020 and 2025, the review provides a comprehensive synthesis of LLM-driven synthetic data generation trends, the breadth of biomedical applications, and emerging evaluation paradigms.

Scope and Methodology

The authors employed PRISMA-ScR guidelines to identify eligible peer-reviewed studies from major scientific databases. Their inclusion criteria required works to (1) address biomedical/clinical tasks, (2) use LLMs explicitly for synthetic data generation, and (3) report on model or data evaluation. Data extraction focused on data modality, model architecture, generation approach, scale, accessibility, target tasks, disease domains, and evaluation methods.

Modalities, Applications, and Accessibility

Synthetic biomedical data generated with LLMs cover a wide range of modalities:

Unstructured clinical text (78%): Clinical notes, discharge summaries, EHR narratives, and disease-related reports dominate synthetic data generation. This is expected given LLMs' strengths in NLP and the abundance of unstructured text within biomedical information systems.
Tabular data (13.6%): Patient demographics, clinical parameters, and longitudinal EHR records are synthesized to support statistical analysis and ML benchmarking.
Multimodal data (8.4%): Including images (e.g., medical imaging, colonoscopy, digital pathology), audio (dialogues, telemedicine), and sensor-generated data (activity, posture).

Biomedical synthetic data is utilized for multiple downstream tasks: disease diagnosis/classification, phenotype extraction, de-identification, clinical summarization, trial matching, social determinants extraction, clinical education, and decision support. Applications range from common diseases to highly specialized domains (e.g., mental health, cancer, anesthesiology), and pilot work demonstrates generalizability across multiple languages.

Accessibility of synthetic datasets is improving but remains inconsistent. About half of the reviewed studies provide open or partially open data resources, reflecting ongoing barriers around privacy and licensing even with synthetic data. This constrains reproducibility and external validation.

Generation Methodologies

Prompt-based generation emerges as the principal approach (72.9%), with LLMs such as GPT-3/4 and Llama series. Two dominant prompting strategies are observed:

Zero-shot prompting: Utilized for structured or description-driven data synthesis (e.g., generating perioperative clinical tables).
Few-shot prompting: Used for more nuanced tasks, with demonstration examples embedded in prompts, particularly when encoding expert-level or complex domain knowledge.

Fine-tuning of LLMs (22%) on domain-specific corpora improves output fidelity, privacy, and the handling of specialized tasks. Studies using open-source models demonstrate enhanced control and customization critical for clinical deployment. However, closed-source models (GPT-4, Gemini) are often preferred for their performance and ease of integration but are limited in transparency and reproducibility.

Specialized and hybrid frameworks, including multi-agent systems and multimodal transformers, are becoming increasingly relevant for generating rich, scenario-specific data (e.g., synthetic doctor-patient dialogues, medical imaging synthesis). Architectures such as hierarchical autoregressive models and topology-guided diffusion-based models address the structural and semantic complexity of biomedical information.

Representative Implementation Outline

from openai import OpenAI

prompt = """
You are a medical expert. Generate a synthetic discharge summary for a 65-year-old male with heart failure, including chief complaint, history, vital signs, labs, and treatments. Ensure all details are plausible but do NOT use any real or known patient data.
"

response = OpenAI().Chat.create(
    model="gpt-4o",
    messages=[{"role": "system", "content": prompt}],
    max_tokens=450
)

synthetic_note = response['choices'][0]['message']['content']

For multimodal synthesis, LLMs can be paired with foundation vision models, using LLM-synthesized text as image generation prompts.

Evaluation Paradigms

Evaluation of synthetic biomedical data employs a mix of intrinsic, extrinsic, human, and LLM-based strategies:

Intrinsic metrics (27.1%): Distributional similarity (e.g., KL divergence, perplexity), correlation matrices for tabular data, structural/statistical parity with real data.
Extrinsic metrics: Downstream performance in classification, information extraction, or predictive modeling tasks.
Privacy risk assessment: Membership inference attacks, PHI presence analysis, and adversarial testing are used to quantify privacy hazards.
Human-in-the-loop evaluation (44%): Clinicians and biomedical experts review clinical validity, factuality, and realism.
LLM-based evaluation: Leveraging strong LLMs as automated evaluators (e.g., factual consistency scoring, rubric-guided ratings).

Notably, there is substantial variability in evaluation methodology and reporting, hindering formal benchmarking. LLM-based evaluation is nascent and does not yet replace human clinical review, especially for high-stakes applications.

Practical and Theoretical Implications

Practical Utility:

LLM-generated synthetic data is increasingly validated for augmenting training sets, overcoming data scarcity, addressing privacy restrictions, and enhancing subpopulation representation. In certain scenarios, models trained on synthetic clinical notes or images reach task performance near those trained on real (sensitive) data. Recent claims include:

Synthetic datasets based on LLMs improve downstream F1 scores by up to 21% in mortality prediction for under-represented subgroups and narrow fairness gaps.
Models trained on synthetic or synthetic-augmented data achieve comparable AUC/F1 to those trained on real-world datasets, facilitating the deployment of ML systems in resource-limited or highly regulated domains.

Theoretical Implications:

The dominance of prompting methods highlights the fundamental alignment between instruction-following LLMs and structured biomedical content. However, there is a persistent trade-off between ease of use (prompting, closed models) and domain customization (fine-tuning, open models). The complexity of multimodal and highly structured clinical data continues to challenge synthetic data fidelity, especially regarding rare events and structured dependencies.

The review identifies the absence of unified standards for synthetic data evaluation, domain adaptation, and privacy assessment as a critical bottleneck to broader adoption and cross-paper comparability.

Limitations and Open Challenges

Evaluation standardization: Heterogeneous evaluation metrics and reporting make systematic comparisons difficult.
Privacy and bias risk: Overfitting, memorization, and inadvertent leakage in synthetic data remain unresolved, particularly as LLMs sometimes memorize rare patterns from their training data.
Closed versus open models: The performance of closed LLMs outpaces open-source models, but reproducibility and data governance favor open models, motivating future research on open-domain, biomedical-specialized LLMs.
Data accessibility: Licensing and sharing of synthetic datasets must be clarified to enable wider benchmarking.

Future Directions

Advancement toward richer multimodal and multi-agent synthetic data pipelines, paired with knowledge infusion from ontologies, will broaden the scope of LLM-generated datasets.
Human-in-the-loop and LLM-based quality assurance frameworks will be essential for assurance in clinical utility and safety.
The field calls for benchmark datasets, task-agnostic validation suites, and standardized privacy metrics.
Open-source clinical LLMs trained on high-quality, synthetic data are poised to lower barriers to innovation while addressing data sharing restrictions.

Concluding Remarks

Synthetic data generation using LLMs is a rapidly maturing solution for overcoming core challenges in biomedical AI, such as data scarcity, privacy, and domain adaptation. While unstructured text remains the dominant application, multimodal and structured data synthesis is gaining traction. Evaluation frameworks must evolve to match the complexity and requirements of real-world clinical deployment. Addressing current limitations in fidelity, utility, and standardized assessment will determine the impact of synthetic data approaches in future biomedical research and practice.

PDF Markdown