SzCORE Framework for EEG Seizure Evaluation
- SzCORE framework is a standardized, open-source platform designed for evaluating EEG-based seizure detection algorithms using event-centric metrics.
- It employs unified BIDS-EEG and HED-SCORE standards to harmonize data formats, annotations, and performance metrics, enhancing reproducibility.
- The framework enables containerized algorithm evaluations via Docker, facilitating fair comparisons and continuous benchmarking in clinical research.
The SzCORE framework refers to the Seizure Community Open-source Research Evaluation architecture, a standardized, event-based benchmarking and evaluation environment for automated EEG-based seizure detection algorithms. Designed to address longstanding challenges in algorithm generalization, dataset inconsistencies, and performance reporting, SzCORE defines public protocols for data formatting, annotation, metrics, cross-validation, and algorithm evaluation. It serves as the foundation for open benchmarking initiatives and large-scale clinical algorithm assessment in epilepsy research (Dan et al., 19 May 2025, Dan et al., 2024).
1. Motivations and Conceptual Objectives
SzCORE originated as a response to the enduring heterogeneity of methodology in EEG seizure detection research. Inconsistent dataset formats, incompatible file organizations, diverse annotation schemas, and a lack of standardized performance metrics have historically frustrated direct comparison of algorithmic approaches. These limitations are compounded in the clinical context, where algorithms must generalize across subjects and settings to be realistically deployable (Dan et al., 19 May 2025).
Key objectives of SzCORE are:
- To specify a unified directory and file structure based on BIDS-EEG and HED-SCORE standards, ensuring all algorithms process identical forms of input data.
- To replace sample-level thresholds and per-dataset metrics with clinically meaningful, event-based criteria focused on seizure episode detection.
- To deliver reference scoring software (notably the
timescoringlibrary) for consistent and reproducible metric computation. These foundational choices collectively enable fair, transparent, and extensible benchmarking of seizure detection systems, accelerating progress toward generalizable, clinically viable solutions (Dan et al., 2024).
2. Dataset Specifications and Annotation Protocols
SzCORE prescribes explicit requirements for datasets and annotations, targeting both clinical and research use cases. Recordings must be organized following BIDS-EEG standards:
- Each subject/session is contained in a separate directory, with EEG stored as EDF files and recording metadata in JSON sidecars.
- Seizure events are annotated in BIDS-EEG/HED-SCORE–compliant TSV files, where each row records onset, duration, event type (according to ILAE 2017 hierarchy), involved channels, and (optionally) confidence scores.
The 2025 challenge utilized 4,360 hours of 19-channel scalp EEG from 65 EMU subjects, with all 398 electrographic seizures independently annotated by three certified neurophysiologists. Annotation discrepancies were resolved by consensus to construct a reliable ground-truth set (Dan et al., 19 May 2025).
Standardization extends to channel order, referencing scheme (common average), and signal resampling (256 Hz mandatory), ensuring reproducibility and compatibility among evolving algorithms and cohorts. These constraints are rigorously enforced both in retrospective dataset harmonization (e.g., TUH, CHB-MIT, Siena, SeizeIT1) and in the open-supported benchmarking pipeline (Dan et al., 2024).
3. Event-Based Scoring Methodology
SzCORE centralizes event-based detection as the clinically relevant metric paradigm. True positives (TP), false positives (FP), and false negatives (FN) are defined at the event-level by temporal overlap between detected and ground-truth seizures. The key formulas are:
- Sensitivity:
- Precision:
- F₁-score:
- False positives per day:
A detection is credited as a TP if it temporally overlaps a reference seizure; default tolerances allow for a pre-ictal detection window (up to 30 seconds before onset) and a post-ictal window (up to 60 seconds after offset). Contiguous events separated by less than 90 seconds are merged, and events exceeding 5 minutes are split to avoid aggregation of prolonged activity (Dan et al., 19 May 2025, Dan et al., 2024).
This methodology abstracts away per-sample noise and promotes operational metrics aligned with real-world clinical expectations.
4. Platform Architecture and Continuous Benchmarking
SzCORE evaluation infrastructure is built on robust principles of reproducibility and transparency. Algorithms are encapsulated as pre-trained Docker containers, ensuring uniform execution environments. Each container consumes standardized BIDS-EEG inputs and produces HED-SCORE–compliant output event files (Dan et al., 19 May 2025).
Evaluation is orchestrated on institutional GPU clusters, leveraging a CaaS (Container-as-a-Service) model. Automated pipelines (using the szcore-evaluation and timescoring libraries) compute all event-based metrics in bulk and record detailed resource utilization (GPU, CPU, RAM) per submission.
After live challenge phases, this entire system remains accessible for continuous benchmarking. New algorithms (or clinical datasets) can be submitted for evaluation under unchanged conditions, fostering long-term, extensible, and reproducible assessment across expanding clinical settings and patient cohorts.
5. Empirical Results: Performance and Generalization
The 2025 SzCORE-supported challenge engaged 30 submissions from 19 teams, resulting in 28 valid algorithm evaluations. The leading approach, "Sz Transformer," achieved event-based F₁=43%, sensitivity=37%, precision=45%, with 1.34 false positives per day. The highest sensitivities among the top five algorithms ranged from 37% to 58%, but generally at the cost of lower precision—values typical of models that maximize detection at the expense of false alarms (Dan et al., 19 May 2025).
A pronounced generalization gap was observed: nearly all participating algorithms reported self-evaluated F₁ scores on local test sets that exceeded their performance on the challenge data, often substantially. This discrepancy supports the assertion that model overfitting and dataset shift remain fundamental concerns and underscores the value of independent, standardized evaluation.
False positives exhibited idiosyncratic profiles: over 90% of all false positives were detected by a minority of algorithms, suggesting limited overlap in spurious detection mechanisms and highlighting the influence of model-specific error patterns rather than ubiquitous EEG artifact sources.
6. Recommendations and Field-Wide Implications
Based on the SzCORE framework implementation and empirical findings, the leading recommendations are:
- Mandatory adoption of BIDS-EEG/HED-SCORE data formats and event-based metrics in seizure detection benchmarking.
- Preferential use of continuous, multi-day EMU recordings with expert consensus annotations over short recordings or limited public datasets.
- Development of algorithms supporting flexible sensitivity–precision calibration to meet heterogeneous clinical requirements.
- Expansion of the SzCORE benchmarking platform for integration of datasets from multiple clinical centers, promoting diverse and representative assessment.
- Sustained community engagement in maintaining and extending Dockerized evaluation workflows to reinforce reproducibility and enable incremental progress toward deployment-grade seizure detection (Dan et al., 19 May 2025, Dan et al., 2024).
SzCORE provides both technical scaffolding (via data standards and reference implementations) and open infrastructure (challenge server, continuous submissions) that address the central limitations of reproducibility and generalization in automated EEG seizure detection research. The challenge results confirm the ongoing difficulty of achieving clinically sufficient performance and highlight the criticality of the standardized, event-centric evaluation paradigm anchored by SzCORE.