MDS-ED Benchmark: Scalable Multimodal ED Evaluation

Updated 1 February 2026

MDS-ED Benchmark is an emerging framework that integrates heterogeneous static, temporal, and unstructured data modalities for scalable ED machine learning evaluation.
It defines diverse prediction tasks—including decompensation, disposition, revisit, diagnosis, and deterioration—using time-indexed vector inputs and rigorous evaluation metrics.
The benchmark supports advanced model architectures and robust evaluations (e.g., AUROC, AUPRC, calibration) while addressing fairness and robustness across demographic groups.

The MDS-ED Benchmark is an emerging framework for comprehensive, multimodal, dynamic, and scalable evaluation of machine learning and foundation models in the context of emergency department (ED) medicine and scientific domains requiring rich, multi-source data integration. Developed in response to the need for rigorous benchmarking of clinical and scientific decision-support models incorporating high-frequency, heterogeneous data, MDS-ED builds on prior standards such as MC-BEC and incorporates advances in multimodal data fusion, time-dependent prediction, robustness evaluation, scalability, and real-world applicability.

1. Multimodal Dataset Composition and Preprocessing

MDS-ED datasets are designed to support modeling approaches that integrate static and temporally modulated clinical, physiological, and unstructured data modalities at scale. In clinical emergency medicine, this includes patient demographics, triage information, vital signs (continuous and trend-based features), laboratory measurements, waveform signals (12-lead ECG sampled at high frequency), orders, medication administrations, free-text imaging reports, and derived indices such as heart-rate variability and perfusion metrics (Alcaraz et al., 2024, Chen et al., 2023). Preprocessing for MDS-ED benchmarks typically involves:

Aggregation and featurization of continuous signals (e.g., 1 min window means, max/min, linear trend extraction).
Feature engineering for waveforms using manual or deep neural techniques (e.g., S4 sequence models, transformer-based embeddings).
Dimensionality reduction and embedding of categorical or textual data via models such as ClinicalBERT, RadBERT, or domain-specific Word2Vec.
Concatenation of modality-specific embeddings to yield visit-level or timepoint-level vectors.
Outlier filtering (e.g., physiological limits), unit normalization, and explicit handling of missingness without imputation in baseline settings.

The benchmark incorporates large-scale, de-identified datasets with patient-wise splits to ensure unique individuals are not mixed across train, validation, and test partitions.

2. Prediction Tasks and Temporal Horizons

MDS-ED formalizes multiple clinical and scientific tasks across diverse time scales and outcome types. In the emergency care context, canonical tasks include (Alcaraz et al., 2024, Chen et al., 2023):

Decompensation: Early prediction (from initial assessment) of acute physiological events (tachycardia, hypotension, hypoxia) with binary labels indicating event onset within defined horizons (e.g., 60/90/120 min post-admission).
Disposition: End-of-visit classification (admission vs. discharge) using all available data accrued up to ED departure.
Revisit: Prediction of ED return within fixed intervals (3, 7, 14 days) post discharge.
Diagnosis: Multilabel classification of discharge diagnoses over extensive ICD-10 code sets.
Deterioration: Multitarget prediction of critical events (cardiac arrest, ICU admission, mechanical ventilation, mortality) at horizons ranging from 24 hours to 365 days.
Resource Utilization and Generation: Estimation of length-of-stay, imaging/lab resource usage, and sequence generation tasks such as ED course summarization.

Formal task definitions use visit- and time-indexed vector inputs $x_{0:T}$ , binary or multilabel targets $y$ , and prediction horizons $h$ . Dynamic tasks often require assessment-modeling architectures capable of handling temporally irregular sampling and event-driven labeling.

3. Evaluation Frameworks, Metrics, and Robustness

Evaluation in MDS-ED benchmarks encompasses predictive discrimination, calibration, monotonicity, missingness robustness, and fairness. Standardized splits and metrics are defined as follows:

Metrics: Area under the precision-recall curve (AUPRC), area under the receiver operating characteristic curve (AUROC), accuracy, concordance index ( $C$ ), mean absolute error (MAE, for regression), and calibration indices (e.g., Brier score, Expected Calibration Error) (Alcaraz et al., 2024, Chen et al., 2023).
Fairness/Bias: True positive rate (TPR) disparities at fixed sensitivity thresholds across demographic strata (age, gender, race, ethnicity) are quantified.
Robustness: Sensitivity of predictive performance to missing modality inputs (e.g., maximum $\Delta$ AUPRC loss when withholding one channel).
Generalization: Site- and year-based splits support analysis of temporal drift and cross-population transferability.
Composite evaluation: For molecular and model discovery domains, metrics include normalized mean squared error (NMSE), model complexity (symbolic tree length), equation residual norms ( $R(f)$ ), and composite fitness functions (e.g., $s(f|u) = 1/(1+NMSE) + \lambda \exp(-l(f)/L)$ ) (Bideh et al., 24 Sep 2025, Xiang et al., 14 May 2025).

Evaluation cohorts are bootstrapped for statistical confidence intervals (e.g., 95% CI over 1000 resamples).

4. Model Architectures and Data Fusion Strategies

MDS-ED benchmarks permit and encourage the use of advanced, multimodal fusion models, often leveraging architectures such as (Alcaraz et al., 2024, Chen et al., 2023):

Gradient-boosted trees (XGBoost, LightGBM, Random Forest): Provided as baseline, unimodal or feature-concatenated approaches.
Sequence models and transformers (S4 classifier, ECG-transformer, Perceiver IO): Used for waveform and cross-modal time series.
Tensor fusion layers: Outer product or flatten-and-concatenate schemes for combining tabular and complex embeddings.
Multitask and multitarget heads: Enabling flexible outputs for multiple clinical events.
End-to-end multimodal models: Explicit encouragement to employ architectures able to learn temporal and cross-modal dependencies, especially in settings with substantial missingness and asynchronous sampling.

Scalability, computational efficiency, and memory requirements are addressed; for example, raw waveform and fusion models demand GPU hardware, while tabular approaches scale efficiently on standard CPUs.

5. Performance Baselines and Comparative Results

MDS-ED benchmark papers report detailed performance for single-task, multitask, unimodal, and multimodal models. Representative figures from the ED context include (Alcaraz et al., 2024, Chen et al., 2023):

Task	Model	Macro AUROC	AUPRC	Challenge/Insight
Diagnosis	Tabular+ECG	0.7873	—	Fusion outperforms unimodal; ∆=0.02–0.03
Deterioration	Tabular+ECG	0.8815	—	14/15 targets >0.8 AUROC
Decompensation 60	LightGBM-ST	—	0.33	Robustness ΔAUPRC up to 0.14 (missing)
Revisit 14d	LightGBM-ST	—	0.24	TPR gap ≈0.11 between racial groups

Statistically significant improvements ( $p<0.01$ ) are observed for multimodal over unimodal models in diagnosing complex cardiometabolic conditions (ICD codes) and predicting deterioration.

6. Scalability, Extensibility, and Best Practices

MDS-ED fosters scalable benchmarking through open-source codebases, dockerized data loaders, pretrained embedding libraries, and modular definitions permitting the addition of novel modalities and tasks. Recommendations and practices outlined include (Chen et al., 2023):

Inclusion of dynamic modalities (progress notes, imaging data, social determinants).
Explicit patient-wise data splits to prevent leakage.
Quantitative strategies for handling missingness (masking, imputation controls).
Comprehensive fairness audits across expanded subpopulations (socioeconomic, language, disability status).
Support for year-based or site-based splits for temporal generalization studies.
Provision of reproducible training/evaluation pipelines and reference results.

A plausible implication is that adoption of such frameworks may improve generalization, transferability, and clinical relevance for AI systems deployed in ED settings.

7. Prospects, Limitations, and Future Directions

The MDS-ED paradigm is extensible to domains beyond emergency medicine, including molecular modeling (electron density prediction with EDBench (Xiang et al., 14 May 2025)) and dynamical system model discovery (MDBench (Bideh et al., 24 Sep 2025)). Key open challenges and areas for expansion include:

Integration of richer unstructured data (imaging, notes, audio).
Advanced temporal modeling (continuous monitoring, sequence-to-sequence prediction).
Robustness to missing modalities, asynchronous sampling, and outlier data.
Incorporation of causal inference, explainability, and cross-site validation for equity.
Extension to generative tasks and meta-evaluation in multimodal dialogue summarization (MDSEval (Liu et al., 2 Oct 2025)).
Optimization of thresholding for density sampling in molecular applications.

This suggests that MDS-ED will serve as a foundational resource for the next generation of interpretable clinical and scientific machine learning models, driving measurable progress on dynamic, multimodal, and scalable real-world benchmarks.