BenchECG: ECG Benchmark Suite

Updated 25 December 2025

BenchECG is a comprehensive benchmark suite for ECG analysis featuring standardized protocols across classification, detection, forecasting, and generation tasks.
It introduces the Feature-based Fréchet Distance (FFD) to robustly evaluate waveform accuracy and clinical signal morphology, overcoming limitations of conventional metrics.
BenchECG benchmarks state-of-the-art architectures and pioneers the Patch Step-by-Step Model (PSSM) that captures ECG’s quasi-periodic, multiscale structure.

BenchECG is a comprehensive benchmark suite and reference protocol for the evaluation of electrocardiogram (ECG) signal analysis, addressing the unique physiological, clinical, and methodological aspects of ECG data. It unifies downstream task definitions, establishes a semantically meaningful evaluation metric, benchmarks state-of-the-art time-series models, and introduces a specialized architecture tailored to ECG’s quasi-periodic, multiscale structure. BenchECG is motivated by the limitations of generic time-series frameworks, which often fail to address ECG-specific signal morphology, clinical event granularity, and the requirements of waveform-aware evaluation (Tang et al., 15 Jul 2025).

1. Task Taxonomy and Clinical Relevance

BenchECG operationalizes ECG analysis via four canonical downstream tasks, each mapped to explicit clinical and signal-processing challenges:

Classification: Single-lead or multi-lead ECG segment input, disease/event label output (e.g., atrial fibrillation, hyperkalemia). Key technical challenges are subtle morphological differences, pronounced class imbalance, and inter-patient heterogeneity.
Detection: Continuous ECG trace input, prediction of time-positions of events (P-wave onsets, QRS complexes, T-wave peaks). The primary obstacles are extreme class skew (positive samples <5%), noise, and baseline drift.
Forecasting: Given past tokens $\{s_1, \dots, s_n\}$ , the model predicts future tokens $\{s_{n+1}, \dots, s_{n+n'}\}$ . The challenge centers on nonstationarities and multiscale periodicity inherent in physiological dynamics.
Generation: Surrogate/noisy ECG inputs (e.g., abdominal ECG, PPG) are mapped to target (physiological or denoised) ECG outputs, entailing difficulty in non-linear source separation, artifact removal, and waveform alignment.

Each task is standardized regarding input format (resampled to 100 Hz, zero-padded/cropped to 500 samples), data preprocessing (exclusion/interpolation protocols for artifact/missingness), and train/test splits (50% each, balanced via up/down-sampling). Datasets cover classic and recent consortia: CPSC2018/2021, AF, NIFEADB, RFAA, SPB (classification); MITDB, SVDB, NFE, FEPL, CPSC2020, MITPDB (detection); CPSC2019, DALIA, RDBH, MIMICSub, NFE, PTB (forecasting); and MITDB+noise, PTBXL+noise, ADFECGDB, FEPL, BIDMC, SST (generation) (Tang et al., 15 Jul 2025).

2. Limitations of Conventional Metrics and Feature-Based Fréchet Distance

Standard metrics such as mean squared error (MSE) are inadequate for ECG: they are insensitive to waveform semantics and penalize minor temporal shifts or amplitude variations excessively. For instance, a predicted trace morphologically congruent but phase-shifted yields a higher MSE than a semantically poor flat-line, and R-peak amplitude outliers disproportionately affect rankings.

To resolve these shortcomings, BenchECG formulates the Feature-based Fréchet Distance (FFD). ECG signals are projected via a fixed feature extractor $f:\mathbb{R}^t\to\mathbb{R}^k$ (a pre-trained transformer encoder) into feature distributions for real and generated signals, modeled as Gaussians $\mathcal{N}(\mu, \Sigma)$ and $\mathcal{N}(\hat\mu, \hat\Sigma)$ , respectively. The FFD is: $\mathrm{FFD}(e, \hat e) = \|\mu - \hat\mu\|^2 + \mathrm{Tr}\left(\Sigma + \hat\Sigma - 2(\Sigma \hat\Sigma)^{1/2}\right)$ where empirical means and covariances are estimated over $N$ samples. FFD is robust to translational misalignment, reflects clinical waveform shape and periodicity, and decouples amplitude sensitivity from clinical morphology, unlike traditional pointwise metrics (Tang et al., 15 Jul 2025).

3. Benchmarking State-of-the-Art Architectures

BenchECG systematically evaluates a representative suite of large time-series models, including:

Informer: Transformer encoder/decoder with ProbSparse attention ( $d$ =512, 6 layers, m=10).
Medformer: Multi-granularity patching, convolutional fusion ( $d$ =256, patch lengths 5,10,20).
UniTS: Masked reconstruction, multitask heads ( $d$ =512, 12 layers, m=20).
Timer: GPT-style decoder for next-token prediction ( $d$ =768, 24 layers, m=10).
ECGPT: Transformer decoder, pre-trained on ECG corpora.

Two regimes are tested for each large model: "raw" pre-training on generic time-series and ECG-specific extended pre-training (denoted *).

Performance is summarized (table: accuracy $\uparrow$ , F1 $\uparrow$ , FFD $\downarrow$ lower is better):

Model	Classification	Detection (F1)	Forecasting (FFD)	Generation (FFD)
Informer	0.686	0.439	1.260	0.665
Medformer	0.766	0.326	1.105	0.601
UniTS	0.701	0.508	1.616	0.671
Timer	0.876	0.638	1.307	0.608
Timer*	0.932	0.734	0.322	0.407
PSSM	0.947	0.820	0.211	0.133

Models pre-trained only on general time-series, without ECG-specific adaptation (e.g. Informer, UniTS), display pronounced deficits in both morphological accuracy and clinical waveform detection. ECG-focused pre-training substantially improves Timer*, but the proposed architecture achieves the highest performance across all axes (Tang et al., 15 Jul 2025).

4. The Patch Step-by-Step Model (PSSM)

PSSM is an explicit ECG-optimized encoder–decoder, hierarchically compressing and reconstructing temporal resolution to match the multiscale periodicity imposed by the cardiac conduction system.

Encoder (for $i=1\ldots L$ ):

Patch: adjacent token groups are averaged,

$\mathrm{Patch}(\bm e^i)_k = \frac{1}{2} (e^i_{2k-1} + e^i_{2k})$

ConvBlock $^i$ : each patch fed through two Conv1D layers (ReLU, LayerNorm), doubling feature dimensionality,

$\bm e^{i+1} = \mathrm{ConvBlock}^i(\mathrm{Patch}(\bm e^i))$

Decoder (for $i=L+1\ldots2L$ ):

Learnable UnPatch: splits token back into two with weights $c_1, c_2$ ,

$(e^{i+1}_{2k-1}, e^{i+1}_{2k}) = (\mathrm{ConvBlock}^i(c_1\,e^i_k), \mathrm{ConvBlock}^i(c_2\,e^i_k))$

Final linear projection yields a feature-mapped output.

Novel elements are the iterative patch/unpatch mechanism bridging local (waveform-scale) and global (rhythm-scale) regularities, and ConvBlocks that encode periodicity with higher efficiency than transformers for quasi-periodic signals. Ablation confirms that removing hierarchical patching severely impairs task performance (e.g., detection F1 down by 70%). The layered, conduction-inspired cascade recapitulates salient ECG structure (Tang et al., 15 Jul 2025).

5. Comparative Insights and Impact

Key outcomes of the BenchECG protocol are:

Metric advancement: FFD maintains stability under ±50 ms waveform shifts, whereas MSE increases more than 5-fold, mapping directly to the waveform semantics used in human clinical interpretation.
Model specialization: PSSM consistently outperforms Medformer (by 80–150% relative improvement in F1/FFD) and even large transformer models with ECG-focused pre-training (Timer*, 8–84% improvement).
Generalization limitation of generic models: Informer, UniTS, Timer (without ECG-specific adaptation) fail to capture quasi-periodic ECG intricacies.
Task protrusion effects: Hierarchical patching is critical for recognizing multi-scale dependencies, as confirmed by ablation.
Attention phenomena: Only the ECG-fine-tuned Timer* demonstrates attention maps with periodic peaks, whereas raw Timer lacks such specificity.

BenchECG creates an infrastructure for reproducible, clinically meaningful research in ECG model development, providing standardized data handling, explicit evaluation protocols, and clear guidance for future foundation model training and assessment (Tang et al., 15 Jul 2025).

Whereas other ECG benchmarks either aggregate diverse datasets for cross-center foundation model evaluation (e.g., OpenECG (Wan et al., 2 Mar 2025)), focus exclusively on survival analysis (Lukyanenko et al., 24 Jun 2024), anomaly detection (Jiang et al., 2023), or delineate waveform segmentation (Park et al., 24 Jul 2025), BenchECG is unique in centering its suite on the interplay between clinical task diversity, semantically aligned metrics, and ECG-specific model architecture. It complements, rather than replaces, dataset-centric efforts by enabling direct side-by-side comparison of general-purpose and ECG-specialized models under controlled, clinically calibrated conditions, and should be seen in the context of ongoing expansion in ECG benchmarking infrastructure (Tang et al., 15 Jul 2025).

7. Future Directions

Immediate elaborations include extension of data sources to incorporate broader demographic and device heterogeneity, refinement of feature extractors for FFD via larger transformer encoders, systematic evaluation of pretraining regimes (contrastive,self-supervised, generative), and integration with ensemble metrics reflecting inter-observer agreement. A plausible implication is the utility of the BenchECG protocol as the backbone for a next-generation, open-access ECG foundation model consortium, anchoring methodological standards in ECG signal analysis (Tang et al., 15 Jul 2025).