Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Automatic diagnosis of the 12-lead ECG using a deep neural network (1904.01949v2)

Published 2 Apr 2019 in cs.LG, eess.SP, and stat.ML

Abstract: The role of automatic electrocardiogram (ECG) analysis in clinical practice is limited by the accuracy of existing models. Deep Neural Networks (DNNs) are models composed of stacked transformations that learn tasks by examples. This technology has recently achieved striking success in a variety of task and there are great expectations on how it might improve clinical practice. Here we present a DNN model trained in a dataset with more than 2 million labeled exams analyzed by the Telehealth Network of Minas Gerais and collected under the scope of the CODE (Clinical Outcomes in Digital Electrocardiology) study. The DNN outperform cardiology resident medical doctors in recognizing 6 types of abnormalities in 12-lead ECG recordings, with F1 scores above 80% and specificity over 99%. These results indicate ECG analysis based on DNNs, previously studied in a single-lead setup, generalizes well to 12-lead exams, taking the technology closer to the standard clinical practice.

Citations (589)

Summary

  • The paper introduces a ResNet-based deep learning model to automatically diagnose 12-lead ECG abnormalities from a dataset of over 2.3 million recordings.
  • It demonstrates robust performance with F1 scores above 0.80 and near-perfect specificity, outperforming human evaluators in classification accuracy.
  • Key innovations include standardized ECG preprocessing, a multi-stage label generation process, and an efficient network design with reduced model complexity.

The paper "Automatic diagnosis of the 12-lead ECG using a deep neural network" (Automatic diagnosis of the 12-lead ECG using a deep neural network, 2019) presents a deep learning approach for classifying multiple abnormalities from standard 12-lead electrocardiograms (S12L-ECGs). This work focuses on adapting architectures successful in computer vision, specifically Residual Networks (ResNets), to the one-dimensional, multi-channel nature of ECG data and evaluating their performance on a large clinical dataset.

Methodology and Data

The paper utilized a substantial dataset derived from the Clinical Outcomes in Digital Electrocardiology (CODE) paper, encompassing over 2.3 million S12L-ECGs collected from more than 1.6 million patients via the Telehealth Network of Minas Gerais (TNMG) in Brazil. The ECGs typically had durations of 7-10 seconds and were sampled between 300-600 Hz.

Data Preprocessing:

Input ECG signals were standardized prior to model training:

  1. Resampling: All ECG recordings were resampled to a uniform 400 Hz.
  2. Padding: Each lead was zero-padded to a fixed length of 4096 samples. This standardization ensures consistent input dimensions for the neural network.

Label Generation:

Generating reliable ground truth labels for the large training and validation sets (98% of the data) involved a multi-stage process, addressing the challenge of large-scale clinical data annotation:

  1. NLP Extraction: Medical diagnoses were extracted from cardiologist text reports using a Lazy Associative Classifier.
  2. Automated Diagnosis: Automated interpretations were obtained using the University of Glasgow (Uni-G) ECG analysis program (statements and Minnesota codes).
  3. Consensus Labeling: Initial labels were accepted if the NLP-extracted medical diagnosis agreed with at least one automated diagnosis method.
  4. Rule-Based Filtering: Labels were rejected based on conflicts with standard ECG criteria (e.g., rejecting Sinus Bradycardia (SB) if heart rate > 50 bpm).
  5. Sensitivity Rules: Specific rules were applied to accept certain labels based on medical diagnosis reliability for specific conditions (e.g., Right Bundle Branch Block (RBBB), First-degree Atrioventricular Block (1dAVb)).
  6. Manual Review: Ambiguous cases (~1.5% of the dataset) underwent manual review by supervised medical students.

Test Set Annotation:

An independent test set of 827 ECGs from distinct patients was curated. Labels for this set were established through annotation by two certified cardiologists, with disagreements resolved by a third senior specialist. Annotators were provided with Uni-G measurements and selected diagnoses from a predefined list, avoiding reliance on NLP for this critical evaluation set.

Deep Neural Network Architecture

The core of the classification system is a Deep Neural Network (DNN) based on the Residual Network (ResNet) architecture, adapted for 1D signal processing.

Architecture Details:

  • Input: 12-lead ECG data, preprocessed to 12 x 4096 tensors.
  • Initial Layer: A standard convolutional layer.
  • Residual Blocks: Four residual blocks follow the initial layer. Each block contains two convolutional layers.
    • Convolution: Filters with a length of 16 were used.
    • Filter Depth: The number of filters started at 64 and doubled every second residual block (e.g., 64, 64, 128, 128).
    • Subsampling: Feature map dimensions were reduced by a factor of 4 within each residual block, likely through strided convolutions or pooling layers.
    • Normalization & Activation: Batch Normalization (BN) followed by Rectified Linear Unit (ReLU) activation was applied after each convolution. The specific implementation used pre-activation ResNets, where BN and ReLU precede the convolution within the residual unit.
    • Regularization: Dropout was applied after the ReLU activation.
    • Skip Connections: Identity mappings (skip connections) characteristic of ResNets were implemented. Max Pooling and 1x1 Convolutions were used within the skip connections to ensure dimensional compatibility when feature map sizes or depths changed between blocks.
  • Final Layer: A fully connected (Dense) layer with a Sigmoid activation function. This output layer produces probabilities for each of the target abnormalities, suitable for multi-label classification (an ECG can present multiple conditions simultaneously).

Model Size: The authors noted that their final architecture used significantly fewer layers and parameters (approximately one-quarter) compared to a previous ResNet applied to single-lead ECGs, despite being trained on a much larger dataset. This suggests efficiency in the architectural choices for the 12-lead context.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
def residual_block(input_tensor, filters, kernel_size=16, subsample=False):
    # Pre-activation path
    x = tf.keras.layers.BatchNormalization()(input_tensor)
    x = tf.keras.layers.Activation('relu')(x)

    # Determine stride for subsampling
    stride = 2 if subsample else 1

    # First convolution
    x = tf.keras.layers.Conv1D(filters, kernel_size, strides=stride, padding='same', kernel_initializer='he_normal')(x)

    # Dropout
    x = tf.keras.layers.Dropout(0.2)(x) # Example dropout rate

    # Second pre-activation and convolution
    x = tf.keras.layers.BatchNormalization()(x)
    x = tf.keras.layers.Activation('relu')(x)
    x = tf.keras.layers.Conv1D(filters, kernel_size, strides=1, padding='same', kernel_initializer='he_normal')(x)

    # Shortcut connection
    shortcut = input_tensor
    if subsample or input_tensor.shape[-1] != filters:
        # Adjust shortcut dimensions using 1x1 Conv and potentially Max Pooling
        shortcut = tf.keras.layers.Conv1D(filters, 1, strides=stride, padding='same', kernel_initializer='he_normal')(input_tensor)
        # Or: shortcut = tf.keras.layers.MaxPooling1D(pool_size=2, strides=stride)(input_tensor) followed by 1x1 Conv

    # Add shortcut to main path
    output_tensor = tf.keras.layers.Add()([x, shortcut])
    return output_tensor

inputs = tf.keras.Input(shape=(4096, 12))
x = tf.keras.layers.Conv1D(64, 16, padding='same')(inputs) # Initial Conv

x = residual_block(x, 64)
x = residual_block(x, 64, subsample=True) # Subsample & increase filters conceptually happens across blocks
x = residual_block(x, 128)
x = residual_block(x, 128, subsample=True)

x = tf.keras.layers.GlobalAveragePooling1D()(x) # Example pooling before Dense
outputs = tf.keras.layers.Dense(num_classes, activation='sigmoid')(x) # num_classes = 6 in this paper

model = tf.keras.Model(inputs, outputs)

Training and Evaluation

Training:

  • Optimizer: Adam optimizer was employed.
  • Loss Function: The model was trained to minimize the average binary cross-entropy loss across the different output labels. L=1Ni=1Nj=1C[yijlog(y^ij)+(1yij)log(1y^ij)]L = -\frac{1}{N}\sum_{i=1}^{N}\sum_{j=1}^{C} [y_{ij} \log(\hat{y}_{ij}) + (1-y_{ij}) \log(1-\hat{y}_{ij})], where NN is the batch size, CC is the number of classes (6), yijy_{ij} is the ground truth label, and y^ij\hat{y}_{ij} is the model's prediction.
  • Learning Rate: An initial learning rate of 0.001 was used, with a reduction by a factor of 10 if the validation loss plateaued for 7 epochs.
  • Initialization: Weights were initialized using He initialization; biases were set to zero.
  • Epochs: Training proceeded for 50 epochs, and the model checkpoint with the lowest validation loss was selected for final evaluation.
  • Hyperparameter Tuning: An iterative manual search (~30 iterations) explored parameters like network depth, kernel sizes, batch sizes, learning rates, optimizers, activation functions, and dropout rates, guided by performance on the validation set.

Evaluation Metrics:

Performance was primarily assessed using:

  • F1 Score: The harmonic mean of precision and recall, chosen for its robustness in scenarios with potential class imbalance. F1=2×Precision×RecallPrecision+RecallF1 = 2 \times \frac{Precision \times Recall}{Precision + Recall}
  • Precision (PPV): TPTP+FP\frac{TP}{TP + FP}
  • Recall (Sensitivity): TPTP+FN\frac{TP}{TP + FN}
  • Specificity: TNTN+FP\frac{TN}{TN + FP}

Thresholding: For each abnormality, the optimal decision threshold (converting the sigmoid output probability to a binary classification) was determined by maximizing the F1 score on the independent test set.

Comparison: The DNN's performance was benchmarked against human evaluators: two 4th-year cardiology residents, two 3rd-year emergency residents, and two 5th-year medical students. Each human evaluated half of the test set.

Stability and Robustness: Model stability was assessed by training 10 models with different random initializations and evaluating the distribution of performance metrics (e.g., Micro Average Precision). Robustness was checked using alternative data splits (random 90/5/5, patient-stratified 90/5/5, chronological 90/5/5).

Error Analysis: A cardiologist reviewed misclassifications made by both the DNN and the human evaluators, categorizing errors into types such as measurement errors, noise-related errors, conceptual errors, and attention errors.

Results and Clinical Implications

The paper reported strong performance for the DNN in classifying the six targeted abnormalities: First-degree Atrioventricular Block (1dAVb), Right Bundle Branch Block (RBBB), Left Bundle Branch Block (LBBB), Sinus Bradycardia (SB), Atrial Fibrillation (AF), and Sinus Tachycardia (ST).

Key Findings:

  • The DNN achieved F1 scores exceeding 0.80 for all six abnormalities on the independent test set.
  • Specificity was consistently high, exceeding 0.99 for all classes, indicating a low false-positive rate.
  • When compared using the F1 score, the DNN outperformed the cardiology residents, emergency residents, and medical students for the evaluated abnormalities.
  • Inter-rater agreement between the two cardiologist annotators for the test set was substantial (median Kappa = 0.86). Agreement between the DNN and cardiologists was also high.
  • Error analysis indicated that the DNN was less susceptible to errors caused by noise in the ECG signal compared to human evaluators.

Clinical and Practical Implications:

  • Automation Potential: The high accuracy, especially the high specificity, suggests feasibility for automating the initial interpretation of a large volume of S12L-ECGs, potentially alleviating clinician workload.
  • Enhanced Access: Such technology could significantly improve access to reliable ECG interpretation in primary care settings, emergency departments, and resource-limited regions lacking immediate access to cardiologists.
  • Decision Support: The DNN can serve as a valuable decision support tool, flagging abnormalities for confirmation or further review by human experts, potentially reducing diagnostic errors, particularly by less experienced personnel.
  • End-to-End Learning: The success validates the end-to-end learning paradigm for S12L-ECGs, where the network learns discriminative features directly from raw signal data, bypassing traditional manual feature extraction steps.
  • Limitations: The paper was limited to six common abnormalities. Performance on other important conditions (e.g., myocardial infarction subtypes, hypertrophy) and the classification of "normal" ECGs requires further investigation. Integration into real-time clinical workflows necessitates prospective validation. Expert oversight remains crucial for complex or ambiguous cases.

Implementation Considerations

Implementing a similar system involves several key aspects:

  • Data Requirements: Access to a large, well-annotated dataset of 12-lead ECGs is crucial. The labeling strategy used in the paper highlights the complexities and potential solutions for generating ground truth at scale.
  • Computational Resources: Training deep models like ResNets requires significant GPU resources. However, the paper suggests that optimized architectures for 12-lead ECGs might be feasible without excessively large models. Inference, once trained, is typically less demanding.
  • Preprocessing Pipeline: A robust preprocessing pipeline (resampling, padding, potentially filtering) is essential for consistent model input.
  • Frameworks: Standard deep learning frameworks like TensorFlow or PyTorch can be used to implement the ResNet architecture. Libraries for signal processing (e.g., SciPy, WFDB) are needed for data handling.
  • Multi-Label Classification: The output layer and loss function must be configured for multi-label classification (Sigmoid activation, binary cross-entropy loss).
  • Validation and Thresholding: Careful validation on an independent test set and appropriate threshold selection based on desired performance characteristics (e.g., maximizing F1, or prioritizing sensitivity/specificity depending on the clinical application) are critical deployment steps.

This research demonstrates the successful application of deep learning, specifically ResNet architectures, to the automated multi-label classification of abnormalities in standard 12-lead ECGs using a large clinical dataset. The high performance achieved, particularly the specificity and favorable comparison to resident physicians on F1 scores, underscores the potential of this technology as a valuable tool in clinical workflows, particularly for improving diagnostic access and efficiency.