Augmented Early Warning Score (aEWS)
- Augmented Early Warning Score (aEWS) is a temporal deep learning framework that predicts acute illnesses like sepsis, AKI, and ALI in real time using 24 hours of EHR data.
- The framework integrates a Temporal Convolutional Network with Deep Taylor Decomposition to provide both high predictive accuracy and explainable, feature-level risk attributions.
- Performance evaluations indicate that xAI-EWS significantly outperforms traditional systems with higher AUROC/AUPRC scores, while its design facilitates real-time clinical alerts and interpretability.
The Augmented Early Warning Score (aEWS), specifically operationalized as the xAI-EWS framework, is a temporal deep learning system designed for real-time prediction of acute critical illness from electronic health records (EHRs). By integrating a high-performing temporal convolutional network (TCN) with Deep Taylor Decomposition (DTD) explanations, xAI-EWS produces interpretable risk scores for sepsis, acute kidney injury (AKI), and acute lung injury (ALI). This architecture provides actionable predictions while surfacing explicit input feature attributions, advancing both predictive accuracy and explainability in clinical early warning systems (Lauritsen et al., 2019).
1. Formal Definition and Mathematical Structure
xAI-EWS computes, for each patient and each outcome , a risk score . The input consists of hourly time bins and routinely recorded features over the preceding 24 hours. The TCN backbone produces a latent representation , which is mapped to class probabilities via a softmax layer: where , , with classes.
Explanations for each feature are generated by Deep Taylor Decomposition, which propagates relevance from the output neuron back through the network. For layers with ReLU activations: where is the activation of neuron , and denotes positive part. The total input-level relevance aggregates as . This yields per-feature attributions quantifying positive or negative support for the risk score (Lauritsen et al., 2019).
2. Input Feature Engineering and Representation
The input matrix comprises time bins, each spanning one hour, and features. Feature aggregation follows these specifications:
- Laboratory parameters (28 variables): Arterial and venous blood gases (including pH, , , , lactate); electrolytes (, , ); P-Albumin, P-Creatinine, P-Bilirubin, P-Prolactin, P-Glucose, P-CRP, P-LDH; blood counts (hemoglobin, leukocytes, neutrophils, platelets, ESR, HbA1c); estimated GFR.
- Vital signs (6 variables): Systolic and diastolic blood pressure, respiratory rate, pulse, SpO, temperature.
Missing values are imputed by carrying forward the last observation. Each hour’s feature value is calculated as the mean of all measurements for that interval (Lauritsen et al., 2019).
3. Model Architecture, Training Regimen, and Validation
The predictive engine is a deep Temporal Convolutional Network constructed as follows:
- Input: matrix of recent time-series data.
- Temporal Blocks: Three sequential blocks, each with 1D dilated causal convolutions (kernel size 2, dilation doubling at each block) and 64 filters, followed by ReLU activation, layer normalization, and 1D spatial dropout (drop-rate 0.1).
- Feature Aggregation: Global average pooling over time produces a fixed-size latent embedding.
- Output Layer: Linear dense transformation and softmax produce 3-way class scores for sepsis, AKI, and ALI.
Training details:
- Loss: Multiclass cross-entropy.
- Optimizer: Adam, learning rate 0.001, batch size 200.
- Weight Initialization: He normal.
- Hardware: NVIDIA Tesla V100 GPU, ~30 minutes to convergence.
- Evaluation: 5-fold cross-validation, with 80%/10%/10% splits for train/validation/test in each fold (Lauritsen et al., 2019).
4. Explanation Methodologies
xAI-EWS applies Deep Taylor Decomposition (DTD), a variant of Layer-wise Relevance Propagation, to generate both local and global explanations:
- Local (individual) relevance: For each patient and time, the top- features ranked by are visualized as sized, colored dots along a timeline, indicating both the magnitude and direction of relevance and the underlying feature value percentile.
- Global (population) relevance: Relevance scores are aggregated across positive cases to compute mean per-feature importance and are visualized analogously to SHAP summary plots.
Implementation leverages the iNNvestigate framework for DTD and the SHAP library for global visualization. This two-tiered strategy allows clinicians to interpret both case-specific and cohort-level drivers of risk (Lauritsen et al., 2019).
5. Performance Assessment and Benchmarking
xAI-EWS was benchmarked against three baselines: MEWS (Danish TOKS), SOFA, and a gradient-boosting model (“GB-Vital”) using six vital signs. Evaluation used two metrics—AUROC and AUPRC—computed as the mean 95% CI over five folds. Results at event onset ($0$ h) are summarized as follows:
| Outcome | xAI-EWS AUROC | SOFA AUROC | MEWS AUROC | GB-Vital AUROC |
|---|---|---|---|---|
| Sepsis | 0.92 (0.90–0.95) | 0.83 | 0.80 | 0.84 |
| AKI | 0.88 (0.86–0.90) | 0.75 | 0.70 | 0.76 |
| ALI | 0.90 (0.89–0.92) | 0.80 | 0.85 | 0.65 |
AUPRC at $0$ h: sepsis 0.43 (0.36–0.51), versus SOFA 0.12 and MEWS 0.10; AKI 0.22 (0.19–0.24), versus SOFA 0.05 and MEWS 0.04; ALI 0.23 (0.21–0.26), versus SOFA 0.07 and MEWS 0.10. Performance degrades gradually with increasing prediction horizon, with AUROC 0.80 for sepsis at 24 h (Lauritsen et al., 2019).
6. Risk Thresholding, Calibration, and Clinical Integration
xAI-EWS produces unthresholded risk probabilities. No specific decision cut-point or calibration curves are given in the development study. Clinical implementation would require empirical calibration (e.g., Platt scaling) and threshold selection (potentially maximizing Youden’s or constraining sensitivity) to align predicted risk with observed outcomes and application needs. The intended workflow involves:
- Hourly aggregation of EHR features;
- Continuous application of the model in a rolling manner;
- Real-time alerts triggered upon exceeding configurable risk thresholds;
- Explanation dashboards tailored for both bedside (local) and audit (population) contexts (Lauritsen et al., 2019).
7. Limitations and Deployment Considerations
Ground truth labels are drawn from established definitions: sepsis per Sepsis-3 within a 48 h window, AKI per KDIGO staging (serum creatinine only), and ALI approximated by first NIV/CPAP use. The model was developed and evaluated solely on a Danish multicenter cohort (2012–2017), without external validation. Only 34 preselected features are supported, omitting potentially informative variables such as medications or comorbidities. Handling of missing data via last-value carry-forward may bias early predictions. No formal calibration or operational thresholding was performed. Real-world use would require prospective trials to ascertain effects on patient-centered outcomes, and ongoing regulatory, audit, and data privacy compliance (GDPR, FDA, CE) with robust documentation and retraining protocols (Lauritsen et al., 2019).