Early Risk Detection Error (ERDE)
- ERDE is a time-aware metric that defines risk detection performance by measuring both the accuracy and timeliness of decisions on sequential data.
- It integrates penalties for false positives, false negatives, and delayed true positives using customizable thresholds and cost functions.
- ERDE has become a standard in evaluating early risk detection models, influencing system design in domains like mental health and safety monitoring.
Early Risk Detection Error (ERDE) is a class of time-aware evaluation metrics explicitly designed to quantify both the accuracy and promptness of automated risk detection models when predicting adverse outcomes (e.g., mental health conditions) from temporal data streams such as social media posts or behavioral logs. ERDE emerged as the standard metric in the CLEF eRisk shared tasks and has since been widely adopted and generalized across sequential risk modeling domains, balancing penalties for late detection, incorrect detection, and missed detection in a single, parameterized framework (Burdisso et al., 2019, Trotzek et al., 2018, Bucur et al., 2021, Thompson et al., 2024, Thompson et al., 16 May 2025, Farooque et al., 22 May 2026).
1. Formal Definition and Variants of ERDE
The canonical ERDE metric (also denoted ERDE or ERDE depending on notation) assigns a per-user error defined by three components: (a) correctness of the detection (true/false positive/negative cases), (b) timeliness of positive detection relative to a deadline or grace parameter (), and (c) application-specific unit costs for each case. The most common sigmoid-based form, as used in CLEF eRisk, is given by:
where
with the index at which a decision is made and the deadline parameter, typically set to $5$ or $50$ (number of posts/chunks) (Burdisso et al., 2019, Trotzek et al., 2018, Bucur et al., 2021). Unit costs 0 are usually set to 1, but can be adjusted to reflect domain-specific trade-offs.
Linear and piecewise-linear variants exist, notably in recent BERT-based and synthetic benchmark evaluations, expressing delay penalty as 2 (for detection time 3) or as 4 for integer cutoff 5 (Thompson et al., 2024, Bucur et al., 2021, Farooque et al., 22 May 2026).
ERDE6 replaces the absolute count 7 with a percentage of the user’s total data available, addressing biases in users with heterogeneous verbosity:
8
where 9 and 0 is expressed as a percent threshold (Trotzek et al., 2018).
2. Intuitive Interpretation and Motivation
ERDE integrates three operational objectives:
- Promptness: Early, correct risk detection ("true positive" before the deadline) yields minimal or zero penalty; late detection is penalized increasingly as delay grows past 1.
- Specificity: Any false positive (flag on a negative case) incurs a maximal penalty.
- Missed Detection: Failing to ever raise a positive decision for a true case (false negative) also receives the maximal penalty.
The latency cost 2 ensures that a correct prediction is not sufficient unless issued early; correctness is modulated by when the prediction is made. The deadline parameter 3 encodes task-specific tolerance for evidence accumulation before full penalty is imposed: smaller 4 enforces stricter earliness, while larger 5 allows more leeway before delay costs are triggered (Burdisso et al., 2019, Thompson et al., 2024, Thompson et al., 16 May 2025).
3. Implementation Protocols and Evaluation Practice
In eRisk protocols and recent longitudinal evaluation frameworks (e.g., Cogniscope), a subject’s data is split into fixed-size temporal units (e.g., 10 "chunks" of posts per user). The system processes each unit sequentially, required to issue a binary decision (risk/no-risk) per subject, after which no further data from that user is ingested. ERDE is computed as the average per-user error across the test set:
- For CLEF eRisk, ERDE6 and ERDE7 are computed by setting 8 or 9, with the final score being the mean across users (Burdisso et al., 2019, Bucur et al., 2021).
- In longitudinal benchmarks such as Cogniscope, for true positives, the penalty for late detection is linear with respect to the onset day and user-level grace window (Farooque et al., 22 May 2026):
0
where 1 is the first time the system alarmed, 2 is ground-truth onset, and 3 is the penalty window.
4. Comparative Analysis and Empirical Results
Empirical analyses demonstrate that enhancements in temporal modeling and context representation yield improved ERDE scores:
- τ-SS3—a text classifier integrating dynamic n-grams—achieves lower ERDE4 compared to bag-of-words baselines on early depression/anorexia detection (Burdisso et al., 2019). For example, ERDE5 dropped from 8.12% (SS3) to 7.70% (τ-SS3) for eRisk 2017 depression, and to 6.17% on eRisk 2018 depression, setting state-of-the-art results.
- In benchmarks, transformer-based and time-aware models achieving earlier correct decisions consistently report lower ERDE than late-firing or conservatively thresholded models, even when raw F1 is similar (Thompson et al., 16 May 2025, Thompson et al., 2024, Bucur et al., 2021).
- Use of ERDE6 aligns system ranking more closely with intuitive early-detection behavior, especially when user post counts vary widely (Trotzek et al., 2018).
Notably, ERDE highlights the inherent trade-off: systems making aggressive early alarms risk high false positive penalty, while overly conservative systems incur steep delay or miss penalties.
5. Limitations, Modifications, and Ongoing Controversies
Critiques of ERDE focus on several systematic limitations:
- Deadline/Parameter Sensitivity: The deadline 7 is task and dataset-dependent, requiring external calibration; varying 8 can substantially alter relative model ranking (Burdisso et al., 2019, Thompson et al., 2024).
- Discrete Chunks vs. Proportional Data: Original formulations penalize by count (9), leading to unfair assessments across users with heterogeneous data lengths. Proportional versions (ERDE0) address this (Trotzek et al., 2018).
- Unit Cost Uniformity: In most evaluations 1, but domain mismatch between real-world consequences and these weights is noted. Some literature suggests increasing 2 to bias away from false alarms (Thompson et al., 16 May 2025).
- Late Decision Penalty: In sigmoid-based ERDE, true positives made after 3 incur penalties similar to false positives, sometimes under-rewarding models with moderate delay but high accuracy (Burdisso et al., 2019).
- Complexity for Downstream Use: The non-differentiable, piecewise nature of ERDE complicates its direct use as a training loss; however, recent work approximates ERDE with surrogate differentiable penalties in temporal fine-tuning of transformers (Thompson et al., 16 May 2025, Thompson et al., 2024).
6. Extensions and Related Metrics
Several alternatives build on or generalize ERDE:
- Time-to-Detection (TTD): Average delay (in time units) between ground-truth onset and alarm, considering only detected positives, and disregarding false positives and missed cases (Farooque et al., 22 May 2026).
- F-latency: Harmonic mean of precision and detection speed, often tracked alongside ERDE for model selection (Thompson et al., 2024, Thompson et al., 16 May 2025).
- Ranking Metrics: Precision@k, NDCG@k, used as complementary criteria to ERDE for systems designed for prioritized screening.
- Sliding‐window Schemes and Delay Encodings: Incorporation of explicit delay tokens into input representations and objective functions, allowing end-to-end optimization for early risk detection readiness (Thompson et al., 16 May 2025, Thompson et al., 2024).
7. Significance and Best Practices in ERDE-Optimized System Development
ERDE operationalizes the core requirement of timely and accurate intervention in longitudinal screening and monitoring, particularly in social or behavioral risk detection contexts. Key practices emerging from recent research include:
- Parameterizing and validating 4 on held-out sets reflecting real-world timeliness demands.
- Encoding temporal delay into the input space when training temporal models.
- Employing proportional or percentage-based ERDE when subject activity levels are highly variable.
- Using ERDE, possibly in conjunction with TTD and F-latency, as early-stopping and model selection criteria.
- Calibrating error costs (5) to match deployment-specific risk tolerances and policy objectives (Burdisso et al., 2019, Trotzek et al., 2018, Thompson et al., 16 May 2025, Thompson et al., 2024).
ERDE and its extensions provide a rigorous, interpretable, and widely adopted standard for evaluating early risk detection systems under real-world constraints, and continue to shape the design and benchmarking of temporal models in health and safety contexts.