DiReCT Benchmark (MIMIC-IV): Clinical AI Evaluation

Updated 3 August 2025

The paper introduces a standardized framework that integrates structured, time-series, imaging, and text data for clinical ML evaluation.
It employs a modular preprocessing pipeline and multimodal fusion to deliver robust, reproducible predictive insights using metrics like AUROC and AUPRC.
Key evaluations include in-hospital mortality and LOS predictions with detailed fairness, interpretability, and bias analyses across demographic subgroups.

The DiReCT Benchmark (MIMIC-IV) is a comprehensive evaluation framework and standardized pipeline for assessing clinical machine learning models—particularly foundation and multimodal models—on publicly available electronic health record (EHR) data. Developed with the explicit goal of enabling consistent, reproducible, and multidimensional evaluation, it harmonizes diverse MIMIC-IV data modalities (structured, time-series, imaging, and text) for downstream predictive, fairness, and interpretability analyses. The DiReCT Benchmark’s integration of rigorous preprocessing, advanced model evaluation, and fairness considerations represents an effort to standardize clinical AI research and facilitate progress in developing real-world trustworthy decision support systems (Yu et al., 20 Jul 2025).

1. Data Integration and Preprocessing Pipeline

The DiReCT Benchmark centers on MIMIC-IV (v2.2), a large public EHR resource encompassing structured hospital and ICU data, chest X-ray images, and clinical free-text notes. All modalities are consolidated at the ICU stay level via a master dataset, with key preprocessing components:

Structured features: demographics, baseline characteristics, comorbidities, and time-series vital signs (sampled during the initial 24 hours of ICU admission).
Imaging: CXR studies (e.g., DICOM images) harmonized for machine learning input.
Text: Clinical notes encoded for downstream natural language processing.

A flexible, modular preprocessing pipeline addresses variable selection, outlier handling, imputation (e.g., mean or mode replacement), chronological sorting of time-stamped records, and aggregation to analytic-ready formats. The pipeline’s transparent design and automatic documentation of parameter choices ensure that both model inputs and preprocessing steps are consistent and reproducible across studies (Yu et al., 20 Jul 2025).

2. Benchmark Predictive Tasks and Model Evaluation

The DiReCT Benchmark supports two primary downstream tasks:

In-hospital mortality prediction: Estimating whether a patient will die during the current admission.
Length-of-stay (LOS) prediction: Predicting whether the ICU stay exceeds three days.

For unimodal benchmarking, specialized encodings are employed:

Time series: fixed time interval aggregation, GRU, or Moment model representations.
Images: domain-specific (e.g., CXR-Foundation) versus general-purpose (e.g., Swin Transformer) feature extraction.
Text: embeddings from RadBERT and general-purpose text models.

Multimodal fusion is accomplished through concatenation of embeddings, which then feed into a logistic regression classifier. The primary evaluation metrics are AUROC, AUPRC, and accuracy, computed with 1,000 bootstrap samples to yield 95% confidence intervals. The AUROC is formally defined as:

$\mathrm{AUROC} = \int_0^1 TPR(FPR^{-1}(x)) \, dx$

where $TPR$ and $FPR$ are true and false positive rates, respectively.

For large vision-LLMs (LVLMs), both domain-adapted and general-purpose variants are prompted with unified templates (integrating images, tabular, and text data rendered as text) and evaluated on binary classification with metrics such as accuracy, precision, recall, specificity, and F1-score (Yu et al., 20 Jul 2025).

3. Multimodal Foundation Models and LVLMs

Eight foundation models encompassing both unimodal and multimodal approaches, as well as domain-adapted and general-purpose architectures, are systematically assessed. Examples include the GRU time-series encoder, CXR-Foundation and Swin Transformer for images, and RadBERT for text. For multimodal learning:

Modular (feature concatenation) fusion demonstrates consistent improvements in predictive performance over unimodal models.
LVLMs such as LLaVA-Med, GPT-4o mini, and LLaVA-v1.5-7b are benchmarked using integrated template prompts. While some LVLMs achieve in-hospital mortality performance comparable to the modular approach, they generally underperform for LOS prediction, highlighting current limitations in LVLM task generalizability and the minimal observed advantage for domain specialization (Yu et al., 20 Jul 2025).

4. Fairness and Interpretability Analyses

The DiReCT Benchmark evaluates group fairness, focusing on age, gender, and race subgroups, using:

Demographic parity: minimum-to-maximum selection rate ratios across groups.
Equalized odds: minimal ratio of TPRs and FPRs between groups.

Findings indicate that increasing data modality count improves prediction performance without introducing additional bias; overall fairness metrics remain stable across subgroups. Interpretability is provided through:

SHAP (SHapley Additive exPlanations), which quantifies feature importance contributions.
Inspection of logistic regression coefficients in fused encodings, consistently showing that time-series modalities are most influential for both mortality and LOS prediction. When modalities are missing, imaging features become more prominent (Yu et al., 20 Jul 2025).

5. Reproducibility and Standardization

The benchmark’s automated, auditable pipeline and modular evaluation protocol enable precise replication across research groups and facilitate direct, fair, and extensible model comparison. Key enablers of reproducibility include:

Complete documentation and open-source release of pipeline management code (https://github.com/nliulab/MIMIC-Multimodal).
Cohort definitions, feature selection, and preprocessing steps fully specified in code/configuration files.
Bootstrapped reporting of all evaluation metrics with confidence intervals.
Standardized prompting and template conversion for LVLM input (Yu et al., 20 Jul 2025).

6. Implications for Clinical AI Research and Deployment

The DiReCT Benchmark demonstrates that integrating structured, time-series, imaging, and text data improves predictive accuracy for core clinical outcomes while balancing fairness. Cost-effective unimodal model fine-tuning and transparent modular fusion outperform current multimodal LVLMs. The comprehensive design, with rigorous evaluation of both fairness and interpretability, supports the development of robust, trustworthy clinical decision support tools. The framework’s extensibility and uniform evaluation also accelerate method comparison, facilitating progress toward deployment-ready clinical AI.

A plausible implication is that effective harmonization of multimodal EHR data and transparent benchmarking of both domain-specific and general-purpose models are essential for building clinically reliable, explainable, and equitable decision support systems (Yu et al., 20 Jul 2025). The DiReCT Benchmark thus serves both as a reference for model performance and as a design pattern for reproducible clinical ML evaluation frameworks.

PDF Markdown Chat (Pro)

References (1)

Benchmarking Foundation Models with Multimodal Public Electronic Health Records (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to DiReCT Benchmark (MIMIC-IV).