Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 99 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 40 tok/s
GPT-5 High 38 tok/s Pro
GPT-4o 101 tok/s
GPT OSS 120B 470 tok/s Pro
Kimi K2 161 tok/s Pro
2000 character limit reached

Generative Medical Event Models Improve with Scale (2508.12104v1)

Published 16 Aug 2025 in cs.LG, cs.AI, and cs.CL

Abstract: Realizing personalized medicine at scale calls for methods that distill insights from longitudinal patient journeys, which can be viewed as a sequence of medical events. Foundation models pretrained on large-scale medical event data represent a promising direction for scaling real-world evidence generation and generalizing to diverse downstream tasks. Using Epic Cosmos, a dataset with medical events from de-identified longitudinal health records for 16.3 billion encounters over 300 million unique patient records from 310 health systems, we introduce the Cosmos Medical Event Transformer ( CoMET) models, a family of decoder-only transformer models pretrained on 118 million patients representing 115 billion discrete medical events (151 billion tokens). We present the largest scaling-law study for medical event data, establishing a methodology for pretraining and revealing power-law scaling relationships for compute, tokens, and model size. Based on this, we pretrained a series of compute-optimal models with up to 1 billion parameters. Conditioned on a patient's real-world history, CoMET autoregressively generates the next medical event, simulating patient health timelines. We studied 78 real-world tasks, including diagnosis prediction, disease prognosis, and healthcare operations. Remarkably for a foundation model with generic pretraining and simulation-based inference, CoMET generally outperformed or matched task-specific supervised models on these tasks, without requiring task-specific fine-tuning or few-shot examples. CoMET's predictive power consistently improves as the model and pretraining scale. Our results show that CoMET, a generative medical event foundation model, can effectively capture complex clinical dynamics, providing an extensible and generalizable framework to support clinical decision-making, streamline healthcare operations, and improve patient outcomes.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces CoMET, a generative medical event model showing that scaling both model and data sizes enhances zero-shot clinical predictions across 78 tasks.
  • The work details a transformer-based architecture with pre-layer normalization, SwiGLU activations, and grouped-query attention trained on 115B medical events from 118M patients.
  • Simulation-based inference enables flexible predictions in diagnosis, prognosis, and operational forecasting, often matching or surpassing supervised models.

Generative Medical Event Models Improve with Scale: An Expert Analysis

Introduction

This paper presents CoMET (Cosmos Medical Event Transformer), a family of decoder-only transformer models trained on an unprecedented scale of structured medical event data from Epic Cosmos. The work systematically investigates scaling laws for generative medical event models and demonstrates that scaling both model and data size yields predictable improvements in downstream clinical prediction tasks. CoMET is evaluated across 78 real-world tasks, including diagnosis, prognosis, and operational forecasting, and is shown to match or outperform task-specific supervised models in most cases, without requiring task-specific fine-tuning.

Model Architecture and Training Pipeline

CoMET leverages the Qwen2 transformer architecture, incorporating pre-layer normalization, SwiGLU activations, rotary positional embeddings, and grouped-query attention. Three model sizes were trained—CoMET-S (62M params), CoMET-M (119M), and CoMET-L (1B)—using a context window of 8,192 tokens. The training dataset comprises 115B medical events (151B tokens) from 118M patients, filtered for longitudinal completeness and clinical relevance.

Medical events are tokenized using a fixed vocabulary of 7,105 tokens, representing demographics, encounters, diagnoses (ICD-10-CM), labs (LOINC quantiles), medications (ATC), procedures (CPT), and time intervals. Tokenization strategies are adapted from ETHOS, with modifications for scale and heterogeneity.

Scaling law analysis was performed via isoFLOP grid search, fitting power-law relationships for compute-optimal model size and training tokens. The exponents (α=0.520\alpha = 0.520, β=0.512\beta = 0.512) closely match those found in natural language domains, indicating proportional scaling of model size and data volume is optimal.

Inference and Simulation-Based Prediction

At inference, CoMET is prompted with a patient's tokenized history and autoregressively generates future medical events, simulating possible health trajectories. Predictions for any target (diagnosis, lab, medication, procedure, encounter type) are derived from Monte Carlo aggregation over nn generated timelines, enabling flexible, zero-shot prediction for arbitrary downstream tasks. Figure 1

Figure 1: CoMET pretraining and inference pipeline, illustrating simulation-based prediction from patient event histories.

Evaluation: Realism and Plausibility

CoMET generations were evaluated for syntactic validity, event prevalence, and co-occurrence. Invalid multi-token event rates were <<0.01% for CoMET-L, and prevalence/co-occurrence RMSLE values decreased with scale, indicating improved realism. Calibration plots for encounter frequency show good agreement between predicted and observed distributions, with expected calibration error (ECE) improving as model size increases. Figure 2

Figure 2: CoMET evaluation performance across major categories, showing improvement with scale and competitive results versus task-specific supervised models.

Figure 3

Figure 3: Calibration of CoMET-L for predicting encounter frequency by type, demonstrating accurate probabilistic forecasting.

Figure 4

Figure 4: Precision-recall curves for single-encounter event generation, with CoMET outperforming lookback baselines and improving with scale.

Figure 5

Figure 5: Heatmaps of event prevalence and co-occurrence in CoMET-L generations versus ground truth, confirming plausibility of simulated timelines.

Disease Risk Prediction and Differential Diagnosis

CoMET was evaluated on disease-specific outcomes (T2DM, HLD, HTN), acute-on-chronic events, and incident disease risk. For T2DM, CoMET-L outperformed supervised models on most outcomes (e.g., CKD progression, neuropathy, retinopathy), with AUCROC improvements up to 0.04. For hyperlipidemia, CoMET-L achieved robust AUCROC (0.93 for CHF diagnosis) but did not surpass supervised models on all tasks, indicating areas for further investigation. Figure 6

Figure 6: Percent increase in AUCROC for CoMET models versus supervised baselines on T2DM-specific outcomes.

Figure 7

Figure 7: Hyperlipidemia-specific outcome prediction, showing consistent improvement with scale but lower performance than supervised models on diagnosis tasks.

Figure 8

Figure 8: Hypertension-specific outcome prediction, with CoMET-L matching or exceeding supervised models on 6/10 tasks.

Acute-on-chronic event prediction (CHF, asthma, sickle cell crisis, alcohol withdrawal, COPD) showed CoMET-L outperforming baselines on 4/5 tasks. Incident disease risk prediction (COPD, CHF, dementia, asthma, alcohol use disorder, heart attack) revealed that CoMET-L achieved higher PR-AUC than baselines on all tasks, though AUCROC gains were limited by class imbalance and simulation count. Figure 9

Figure 9: Acute-on-chronic outcome prediction, with CoMET-M/L outperforming baselines on most tasks.

Figure 10

Figure 10: Incident disease risk prediction, with CoMET-L showing PR-AUC gains but limited AUCROC improvement due to low prevalence.

CoMET also demonstrated the ability to generate ranked, quantitative differential diagnoses for hepatopancreatobiliary and rheumatic disease clusters, outperforming supervised models in sensitivity and AUCROC, and flagging correct diagnoses earlier in the patient timeline. Figure 11

Figure 11: Hepatopancreatobiliary differential diagnosis, with CoMET-L flagging correct diagnoses earlier and with higher sensitivity.

Figure 12

Figure 12: Rheumatic differential diagnosis, showing CoMET-L's ability to distinguish among similar presentations.

Operational Forecasting: Utilization, Length of Stay, Readmission

CoMET models were evaluated for forecasting encounter counts, hospital length of stay (LOS), and 30-day readmission risk. CoMET-L achieved lower MAE than supervised regression models for encounter frequency and LOS prediction, and higher AUCROC (0.770 vs. 0.717) for readmission risk. Figure 13

Figure 13: One-year encounter frequency forecasting, with CoMET models outperforming supervised regression baselines.

Figure 14

Figure 14: Hospital length of stay prediction, with CoMET models achieving lower MAE than supervised baselines.

Figure 15

Figure 15: ROC curves for 30-day readmission prediction, with CoMET-L outperforming the best supervised model.

Scaling law analysis confirmed that training loss decreases predictably with increased model size and data volume, following power-law relationships. Downstream evaluation metrics (AUCROC, PR-AUC, MAE) improved sigmoidally as train loss decreased, with most tasks not yet plateaued, indicating substantial headroom for further scaling. Figure 16

Figure 16: Scaling law analysis, showing compute-optimal model size and training tokens follow power laws.

Figure 17

Figure 17: Downstream performance versus train loss, demonstrating sigmoidal improvement across tasks.

Test-time compute (number of generated simulations) was shown to orthogonally improve performance, especially for low-prevalence tasks, with diminishing returns beyond n≈64n \approx 64. Figure 18

Figure 18: Effect of test-time compute on performance, with increased simulations yielding higher metrics.

Model Validity, Bias, and Fairness

Syntactic validity of generated events was high, with error rates <<0.01% for CoMET-L. Subcohort analysis of AUCROC by demographic group showed no evidence of systematic bias, though further work is needed for prospective validation and fairness assessment. Figure 19

Figure 19: Syntactic validity of generated events across CoMET models.

Figure 20

Figure 20: Subcohort analysis of bias and fairness, with CoMET-L showing consistent AUCROC across demographic groups.

Limitations and Future Directions

Key limitations include reliance on real-world EHR data subject to documentation errors and missingness, discretization of continuous values in tokenization, and context window constraints. Evaluation focused on aggregate metrics; subpopulation calibration and prospective validation remain open areas.

Future work should incorporate additional structured and multimodal data types, enable counterfactual reasoning, extend time-to-event analysis, and explore fine-tuning for task-specific improvements. Human factors research and governance frameworks are essential for responsible deployment.

Implications and Prospects

The results establish that generative medical event models scale predictably and deliver strong zero-shot performance across diverse clinical tasks. CoMET's simulation-based approach enables flexible, out-of-the-box prediction for diagnosis, prognosis, and operational forecasting, with competitive or superior performance to task-specific supervised models. The scaling law findings provide a principled methodology for resource allocation in future model development.

Theoretical implications include confirmation that scaling laws observed in NLP domains extend to structured medical event data, with similar exponents but higher token-to-parameter ratios. Practically, CoMET offers a general-purpose engine for real-world evidence generation, supporting clinical decision-making and health system operations.

Conclusion

CoMET demonstrates that large-scale generative medical event models trained on longitudinal EHR data can match or exceed the performance of task-specific supervised models across a wide range of clinical and operational tasks. Scaling both model and data size yields predictable improvements, and simulation-based inference enables flexible, zero-shot prediction. These findings position generative medical event models as a foundational technology for scalable, personalized medicine and real-world evidence generation.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Youtube Logo Streamline Icon: https://streamlinehq.com

alphaXiv