SimSUM Dataset: Multimodal EHR Benchmark

Updated 28 November 2025

SimSUM is a synthetic, multimodal benchmark that integrates structured EHR data with generated clinical notes using an expert-informed Bayesian network.
It enables controlled evaluation of information extraction and multimodal fusion methods by providing full transparency of the data-generating process.
The dataset supports studies on virtual evidence, consistency node fusion, and interpretable risk modeling with reproducible simulation of patient records.

SimSUM is a synthetic, multimodal benchmark dataset purpose-built to support research on patient-level information extraction from electronic health records (EHRs). It integrates structured tabular data and precisely generated clinical notes through a fully specified, expert-informed Bayesian network (BN), providing a transparent environment for the development and rigorous evaluation of information extraction and multimodal fusion methods. SimSUM is distinct in that both the graphical data-generating process and conditional distributions are provided in full, addressing a persistent limitation in most existing EHR benchmarks where ground-truth semantics are either implicit or unknown (Rabaey et al., 21 Nov 2025).

1. Motivation and Design Rationale

The principal motivation behind SimSUM is the inherent multimodality of real-world EHRs: while some clinical attributes are routinely structured (e.g., laboratory codes, medication lists), a substantial proportion of critical information—such as patient-reported symptoms and nuanced clinical findings—remains embedded in unstructured narratives. Prior benchmarks have typically separated modeling challenges by providing either only structured fields or only text corpora, preventing the paper of principled fusion of probabilistic and neural methods and impeding evaluation due to the absence of a known data-generating mechanism.

SimSUM's design provides joint synthetic records in which each patient's structured and unstructured fields are sampled according to a known, expert-designed BN reflecting respiratory disease etiology. This enables controlled paper of information fusion and probabilistic inference, with known ground-truth for both causal structure and parameterization (Rabaey et al., 21 Nov 2025).

2. Data Generation and Simulation Process

The data-generating process in SimSUM is governed by an expert-informed BN encoding dependencies between patient background, clinical risk factors, symptoms, diagnoses, treatments, and outcomes:

BN Structure and Variables:
- Background: Asthma, Smoking, COPD, Hay fever (all binary); Season (winter/other)
- Diagnosis: Pneumonia, Common cold (both binary)
- Symptoms: Dyspnea, Cough, Pain, Nasal symptoms (binary); Fever (none/low/high)
- Treatment/Outcome: Antibiotics (binary); #Days at home (range 0–15)
Conditional Distributions:
- Background conditions, Season, diagnoses, and Fever are sampled from discrete conditional probability tables (CPTs).
- Symptoms employ a noisy-OR function over parental nodes
$P(X=1|P_1,...,P_m) = 1 - \prod_{j:P_j=1} (1-\theta_j)$ - Antibiotics are modeled via logistic regression on the symptom vector; #Days at home is generated by discretized Poisson regression over symptoms and treatment.
Sampling and Note Synthesis:
- Sampling proceeds in BN topological order, respecting all directed dependencies.
- For each record, both standard-length ("normal") and condensed-format ("compact") clinical notes are produced via LLM prompt, constrained to mention exactly and only those symptoms present in the ground-truth vector, with occasional reference to background variables. Diagnoses are systematically omitted from the generated text. Sentence templates are used to preclude confounds from spurious mentions.
- 10,000 records are synthesized, each with paired notes (Rabaey et al., 21 Nov 2025).

3. Dataset Composition and Modalities

Each SimSUM patient record is comprised of:

Structured (Tabular) Feature Vector (9 variables):
- History: Asthma, Smoking, COPD, Hay fever $\in \{0,1\}$
- Season $\in \{\text{winter}, \text{other}\}$
- Diagnoses: Pneumonia, Common cold $\in \{0,1\}$
- Treatment: Antibiotics $\in \{0,1\}$
- Outcome: $\#$ Days at home $\in \{0,\ldots,15\}$
Target Symptoms (training use only):
- Dyspnea, Cough, Pain, Nasal symptoms $\in \{0,1\}$ ; Fever $\in \{\text{none}, \text{low}, \text{high}\}$
Narrative Note (string):
- Structured as "History" $\to$ "Physical Exam," consisting of 3–10 sentences and mentioning precisely the sampled present symptoms.

Component	Variable Types	Encoding/Values
Tabular	History, Season, Diagnoses, Treatment, Outcome	Binary/multiclass (as above)
Target Symptoms	Dyspnea, Cough, Pain, Nasal, Fever	0/1 or categorical (training only)
Clinical Note	Structured free-text	3–10 sentences, symptoms-only mentions

4. Probabilistic Framework

The joint probability over all variables is explicitly factorized as:

$\mathcal{P}_{tab}(V_1,\dots,V_{14}) = \prod_{i=1}^{14} \mathcal{P}(V_i | \mathrm{Pa}(V_i))$

where $V_i$ and $\mathrm{Pa}(V_i)$ denote the $i$ -th variable and its BN parents, with specific ordering reflecting the underlying clinical logic (see (Rabaey et al., 21 Nov 2025), Eq. 10).

For symptom nodes, the noisy-OR parameterization ensures interpretability and modularity of symptom expression with respect to multiple potential causes.

SimSUM enables rigorous evaluation of multimodal fusion strategies. In particular:

Virtual evidence is modeled as an auxiliary child node $\tilde{s}$ with emission probabilities determined by a neural classifier's confidence:

$P(\tilde{s}=1|s=1) = p_{\mathrm{NN}}, \quad P(\tilde{s}=1|s=0) = 1 - p_{\mathrm{NN}}$

Consistency node fusion incorporates a learned node $C$ combining BN and neural classifier outputs:

$P(C \mid B, T) \propto P(B \mid \mathrm{tab})\, P(T \mid \mathrm{note})\, P(C\mid B,T)$

with $B$ and $T$ denoting BN and text classifier symptom probabilities, respectively, and $P(C=1|B=b, T=t)$ calibrated by empirical counts on the training set.

5. Dataset Statistics and Distributional Characteristics

Size: 10,000 synthetic patient records (each with both "normal" and "compact" notes; typically one is used per experiment)
Prevalence Estimates:
- $P(\text{Pneumonia}) \approx 10$ –15%
- $P(\text{Common cold}) \approx 30$ –40%
- Symptom prevalence: Dyspnea $\sim$ 20%, Cough $\sim$ 40%, Pain $\sim$ 15%, Nasal symptoms $\sim$ 30%, Fever (high/low) combined $\sim$ 25%
Partitioning: 80/20 train/test split; training data can be logarithmically subsampled (100 to 8000 patients) to assess sample efficiency and robustness.

6. Research Applications and Evaluation Scenarios

SimSUM is constructed for manifold downstream tasks and methodological investigations:

Patient-level Information Extraction: Assign calibrated symptom probabilities to EHR records via integration of tabular and note content.
Multimodal Fusion Mechanism Study: Evaluate virtual evidence, consistency node, and black-box fusion (purely neural) strategies in a setting with known semantics.
Interpretable Risk Modeling: Enriched, post-extraction tabular features support transparent models (decision trees, logistic regression, BNs) for outcome or treatment prediction (e.g., Antibiotics use, days off work).
Robustness and Calibration Analysis: The simulation framework facilitates the introduction of controlled perturbations (e.g., masking text) to stress-test data fusion architectures.

7. Limitations and Prospects for Extension

SimSUM's principal limitations stem from its simulated nature and scope:

The language of clinical notes is synthetic; it does not encompass the full diversity, informality, or noise typical of authentic clinical narratives.
BN structure and CPDs are fixed and known; real-world deployment typically necessitates structure or parameter learning, introducing potential for misspecification.
Restriction to respiratory conditions and a finite set of 14 variables constrains domain generalizability.

Ongoing and future directions for SimSUM include: expanding to additional clinical modalities (e.g., imaging, temporal lab trajectories), scaling to larger and hierarchically organized concept ontologies (such as full ICD-10 mapping), simulating annotation or LLM noise to approach genuine textual ambiguity, and blending expert-defined structural priors with automated structure learning approaches (Rabaey et al., 21 Nov 2025).

In conclusion, SimSUM offers a rigorously defined, reproducible, and multimodal test bed for research on information extraction, probabilistic modeling, and robust multimodal learning in the EHR setting. Its transparent BN framework and aligned clinical notes uniquely position it for benchmarking both interpretable and black-box methods in clinical informatics.

PDF Markdown Chat (Pro)

References (1)

Patient-level Information Extraction by Consistent Integration of Textual and Tabular Evidence with Bayesian Networks (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to SimSUM Dataset.