CHB-MIT Scalp EEG Dataset Overview
- The CHB-MIT Scalp EEG Dataset is a publicly available, annotated repository of long-term, multi-channel pediatric EEG recordings used for benchmarking seizure detection and prediction.
- It comprises over 844 hours of recordings from 23 patients, with detailed annotations covering pre-ictal, ictal, and inter-ictal states for rigorous analysis.
- The dataset has driven advances in signal processing and machine learning by providing standardized protocols and preprocessed derivative releases for epilepsy research.
The CHB-MIT Scalp EEG Dataset is a publicly available, extensively annotated resource of long-term, multi-channel scalp electroencephalogram (EEG) recordings from pediatric epilepsy patients collected at Boston Children's Hospital and the Massachusetts Institute of Technology. Designed to serve as a reference for benchmarking seizure detection, prediction, and EEG decoding algorithms, the dataset has become a de facto standard in EEG machine learning and signal processing research, supporting a broad spectrum of methodological advances across clinical neuroengineering, computational neuroscience, and bio-signal analytics.
1. Dataset Composition and Subject Cohort
The CHB-MIT Scalp EEG Dataset consists of continuous scalp EEG recordings from 23 pediatric patients (denoted CHB01–CHB23) with medically intractable epilepsy (Handa et al., 2023, Zhang et al., 18 Jan 2026). The age range spans 1.5 to 19 years (Handa et al., 2023), with each subject undergoing multi-day inpatient video-EEG monitoring as part of pre-surgical clinical evaluation. Aggregate recording duration totals approximately 844–900 hours, with each patient contributing between 9 and 42 EDF (European Data Format) files, typically 1 hour per file (Handa et al., 2023). The dataset includes around 160–200 annotated clinical seizures (median per subject ≈ 8, range 3–40) (Uppalapati et al., 19 Sep 2025). All subjects were on anti-seizure medication during data acquisition (Handa et al., 2023).
Recordings were obtained with 23 electrodes placed according to the international 10–20 system in a bipolar montage. In certain files, additional electrodes or vagal-nerve stimulation markers are included, but the canonical setup is 23-channel bipolar (Handa et al., 2023, Zhang et al., 18 Jan 2026). Universal technical parameters are a sampling rate of 256 Hz and 16-bit resolution (Handa et al., 2023, Zhang et al., 18 Jan 2026, Benfenati et al., 2024).
2. Seizure Annotation and Phenotyping
Expert epileptologists reviewed the continuous EEG to annotate seizure onset and offset times, constituting the ground truth for all downstream analyses (Chen et al., 2014). Annotations are provided as text files alongside EDFs, specifying the clock times of each electrographic event (Handa et al., 2023). The dataset contains pre-ictal, ictal, and inter-ictal states. Pre-ictal epochs are frequently operationalized as fixed windows immediately preceding seizure onset (typical intervals: 10–30 min for prediction tasks, 10 s for onset localization), with inter-ictal epochs sampled from seizure-free periods well-separated from ictal episodes (e.g., >4 h buffer) (Zhang et al., 18 Jan 2026, Chen et al., 2014). Statistical summaries—such as average seizure duration or time-of-day distributions—are not systematically included in the public release (Zhang et al., 18 Jan 2026).
3. Data Preprocessing Pipelines
Preprocessing workflows are highly task-dependent but consistently leverage and build upon the canonical EDF files. Common steps include:
- Band-pass filtering: Ranges vary from 0.5–70 Hz (Chen et al., 2014) to 1–70 Hz (Uppalapati et al., 19 Sep 2025); notch filtering is universally applied at 60 Hz and sometimes at harmonics (117–123 Hz) to suppress line noise (Zhang et al., 18 Jan 2026, Chen et al., 2014).
- High-pass filtering: To remove baseline drift, with cutoffs at 1 Hz in seizure prediction contexts (Zhang et al., 18 Jan 2026).
- Artifact correction: Ranges from none (for streamlined, hardware-oriented workflows) (Ingolfsson et al., 2021) to comprehensive pipelines including manual rejection, independent component analysis (ICA), and removal of drowsy/contaminated segments in biomarker discovery (Uppalapati et al., 19 Sep 2025).
- Normalization: Channelwise z-score normalization is often employed, using the mean and standard deviation of each channel (Zhang et al., 18 Jan 2026).
- Epoching: Non-overlapping windows of fixed length are extracted for analysis, e.g., 1 s (Zarei et al., 2023), 5 s (Zhang et al., 18 Jan 2026), 8 s (Ingolfsson et al., 2021, Benfenati et al., 2024), 10 s (Chen et al., 2014).
- Feature extraction: Signals are either used directly (deep learning), decomposed via Discrete Wavelet Transform (DWT) (Ingolfsson et al., 2021), or reduced to spectral/spatial features (bandpower, connectivity metrics, line-length, etc.) (Uppalapati et al., 19 Sep 2025, Zarei et al., 2023).
Preprocessing procedures are summarized in the following table:
| Paper (arXiv ID) | Band/Notch Filter | Artifact Handling | Normalization | Epoch Length |
|---|---|---|---|---|
| (Zhang et al., 18 Jan 2026) | 1 Hz HPF, 57–63,117–123 Hz notch | None specified | z-score (per channel) | 5 s |
| (Benfenati et al., 2024) | 0.5–50 Hz band (5th order) | None, robust to artifacts | None | 8 s |
| (Uppalapati et al., 19 Sep 2025) | 1–70 Hz band, 60 Hz notch | Manual+ICA+bad-channel | None | ~6 min blocks |
| (Zarei et al., 2023) | None specified | None, features only | quantile (per feature) | 1 s |
| (Ingolfsson et al., 2021) | Implicit in DWT (no explicit) | None, DWT only | None | 2–8 s |
| (Chen et al., 2014) | 0.5–70 Hz band, 59–61 Hz notch | None | None | 10 s |
4. Task-Specific Dataset Construction
Common tasks and their cohort definitions:
- Seizure prediction: Preictal versus interictal classification. Preictal is typically defined as 5 s epochs within 30 min before onset (Zhang et al., 18 Jan 2026) or 10 s immediately before onset (Chen et al., 2014). Interictal epochs are taken from periods ≥4 h separate from any seizure, with postictal buffers (e.g., 30 s) enforced (Zhang et al., 18 Jan 2026). Ictal states are often excluded from training in prediction paradigms.
- Seizure detection: Detection/classification of seizure segments; positive windows are any epochs overlapping with an annotated seizure (Benfenati et al., 2024, Ingolfsson et al., 2021). Extreme class imbalance is managed variously, e.g., via weighted sampling or majority-vote smoothing (Benfenati et al., 2024, Ingolfsson et al., 2021).
- Biomarker extraction: Selection of interictal, artifact-free, awake, and eyes-open blocks for electrophysiological severity indexing such as DS-Qi (Uppalapati et al., 19 Sep 2025).
Dataset splits may be within-subject (e.g., random 70/30 train/test partition) (Zhang et al., 18 Jan 2026), leave-one-recording-out or nested cross-validation (Benfenati et al., 2024), or per-patient model evaluation protocols (Zarei et al., 2023).
5. Derivative and Balanced Releases
To facilitate reproducible machine-learning (ML) benchmarking, derivative distributions have been released:
- IEEE DataPort pre-processed version: Balanced, fixed-length preictal and ictal segments (4096 s per patient, 23 channels), no additional filtering or artifact removal (Handa et al., 2023).
- Meta-EEG: Fixed 10 s pre-ictal, ictal, 10 s post-ictal, and non-seizure fragments for ML, not yet open-sourced (Handa et al., 2023).
- PhysioNet: Original EDFs with annotation files for customizable preprocessing and segmentation workflows (Handa et al., 2023).
6. Applications and Algorithmic Benchmarks
The dataset has catalyzed algorithmic development and benchmarking in:
- EEG decoding: Hierarchical convolutional fusion transformers (HCFT), achieving per-patient sensitivity of 99.10%, specificity of 98.82%, and 0.0236 FP/h (Zhang et al., 18 Jan 2026).
- Deep learning segmentation: BERT-based foundation models fine-tuned to CHB-MIT, with LOOCV protocols yielding sensitivity and false-positive rates suitable for clinical deployment (e.g., 0.23 FP/h at 72.6% sensitivity) (Benfenati et al., 2024).
- Classical ML pipelines: Support Vector Machines, Random Forest, Extra Trees, AdaBoost, etc., frequently achieving near-perfect sensitivity and sub-unit FP/h in subject-specific settings, including when constrained to only 4 temporal channels for wearable hardware prototyping (Ingolfsson et al., 2021).
- Feature embedding methods: Periodic embedding modules yielding 100% sensitivity and up to 99.02% specificity (SVM) for 1 s epoch-based detection (Zarei et al., 2023).
- Information-theoretic and connectivity analyses: L-SODA for preictal detection (97.1% sensitivity, mean onset latency 2.8 s) (Chen et al., 2014); DS-Qi for electrophysiological disease severity quantification (Uppalapati et al., 19 Sep 2025).
7. Research Impact and Limitations
The CHB-MIT database’s influence derives from its open availability, standardized annotation schema, and breadth across pediatric epilepsy phenotypes (Handa et al., 2023). It is the prevailing benchmark for pediatric seizure detection and EEG decoding (Benfenati et al., 2024). Limitations include high inter- and intra-patient variability, imbalanced class distributions, the absence of high-density or invasive channels, limited metadata on comorbidities and pharmacopathology, and variability in montage (channels range from 18 to 23 in downstream studies) (Zhang et al., 18 Jan 2026, Uppalapati et al., 19 Sep 2025). Average seizure duration, etiological homogeneity, and finer quantitative EEG state metadata are incompletely documented—necessitating task-specific, post hoc curation in most research use cases (Zhang et al., 18 Jan 2026).
The dataset’s lasting value lies in catalyzing robust methodology development across signal processing, clinical BCI, and explainable AI for neurocritical care.