GAF-FusionNet: Multimodal ECG Analysis
- The paper presents a dual-branch network that combines 1D temporal signals and GAF-transformed spatial features through a dual-layer split-attention mechanism.
- Temporal dynamics are captured with 1D CNN and bidirectional LSTM, while spatial patterns are extracted using a 2D CNN on GAF images.
- Ablation studies confirm that the integrated attention and dual-modal fusion significantly boost ECG classification accuracy on multiple benchmark datasets.
GAF-FusionNet is a multimodal deep learning framework for electrocardiogram (ECG) analysis that integrates both temporal and spatial information by jointly processing the raw 1D ECG signal and a Gramian Angular Field (GAF) image encoding of the same segment. The network architecture relies on a dual-branch system: one branch applies 1D convolutional and recurrent neural layers to the raw signal, while the other processes a GAF-transformed image with a 2D CNN. Feature fusion is accomplished via a dual-layer cross-channel split-attention mechanism, enabling adaptive inter- and intra-modal integration for improved classification performance across diverse ECG datasets (Qin et al., 2024).
1. Background and Motivation
Conventional deep learning approaches for ECG classification are typically unimodal, relying either on raw time-series representation or simple feature fusion strategies. Many do not fully exploit complementary temporal and spatial features that may be distributed across different representations of the same ECG segment. GAF-FusionNet addresses these challenges by encoding each ECG window as both a normalized time series and its GAF image. This dual representation allows the model to leverage fine-grained temporal patterns (such as the shape of QRS complexes) and global dynamic features (such as rhythm regularity), with the GAF providing a 2D structure that makes temporal dependencies more accessible to standard CNN architectures (Qin et al., 2024).
2. Gramian Angular Field (GAF) Transformation
GAF transformation enables the conversion of a 1D time series into a 2D image. The transformation comprises three stages:
- Rescaling: Each value is normalized to :
- Angular Encoding: The normalized value is mapped to an angle:
- Gramian Matrix Computation:
- Gramian Summation Field (used in GAF-FusionNet):
- Gramian Difference Field (alternative, not used in current implementation):
The matrix serves as a 2D image input, preserving global and local temporal structure for the spatial-branch CNN (Qin et al., 2024).
3. Network Architecture and Feature Fusion
GAF-FusionNet consists of parallel temporal and spatial branches followed by dual-layer cross-channel split attention for feature fusion.
Temporal Branch:
- Input: raw ECG segment .
- Processing: Several 1D convolutional layers with ReLU nonlinearity, followed by a bidirectional LSTM, and global average pooling, producing a feature vector .
Spatial Branch:
- Input: GAF image .
- Processing: 2D CNN (ResNet-34 or custom), ending with global average pooling to produce feature vector 0.
Dual-Layer Split-Attention Fusion:
- Layer 1: Intra-modality self-attention for 1 and 2:
3
- Layer 2: Cross-modality attention:
4
- Fused outputs:
5
- Classification: 6 is concatenated and passed to an MLP classifier with softmax for final prediction.
This design enables nuanced modeling of both within-modality dependencies and inter-modality interactions, advancing beyond naive concatenation or averaging (Qin et al., 2024).
4. Training Methodology and Datasets
The network is trained end-to-end on standard ECG benchmarks (ECG200, ECG5000, MIT-BIH Arrhythmia Database) with the following protocol:
- Preprocessing: Butterworth bandpass filtering, z-score normalization, segmentation into windows.
- Optimizer: Adam with cosine-annealing learning rate scheduling, batch size of 64.
- Spatial CNN Backbone: ResNet-34 (pretrained on ImageNet).
- Regularization and Augmentation: Early stopping based on validation loss; data augmentation arises from window overlap.
- Loss Function: Cross-entropy, with categorical labels.
GAF-FusionNet demonstrates robust generalization across small to large datasets, consistently outperforming state-of-the-art methods in classification accuracy (ECG200: 94.5%, ECG5000: 96.9%, MIT-BIH: 99.6%) (Qin et al., 2024).
5. Empirical Performance and Ablation Analysis
Comparative evaluation demonstrates that GAF-FusionNet outperforms leading models such as LSTM-FCN, Informer, Attention-CNN, and Multi-Scale CNN on all standard ECG datasets. Ablation studies reveal:
- Removing dual-layer attention reduces MIT-BIH accuracy by 1.8%.
- Removing the cross-channel module reduces accuracy by 1.5%.
- Using only the time series or only the GAF branch reduces accuracy by 2.6% and 2.1%, respectively.
| Method | ECG200 Acc. | ECG5000 Acc. | MIT-BIH Acc. |
|---|---|---|---|
| DNN (raw) | 88.5 % | 93.2 % | 95.7 % |
| LSTM-FCN | 91.0 % | 94.1 % | 96.3 % |
| Informer | 91.5 % | 94.8 % | 97.1 % |
| Attention-CNN | 92.0 % | 95.3 % | 97.5 % |
| Multi-Scale CNN | 92.5 % | 95.7 % | 97.8 % |
| GAF-FusionNet | 94.5 % | 96.9 % | 99.6 % |
Performance gains trace directly to the hybrid exploitation of GAF-based spatial information, temporal encoding, and adaptive attention-based feature fusion (Qin et al., 2024).
6. Limitations and Future Directions
Despite superior benchmark performance, GAF-FusionNet currently faces two primary limitations:
- Generality to Clinical Settings: Experiments are limited to public benchmarks. Extending the model to real-world clinical ECG datasets—characterized by broader noise/artifact distributions—remains an open challenge.
- Computational Overhead: The spatial branch (2D CNN on 7 GAF images) and the dual-layer attention module introduce additional computational cost, particularly in large-scale deployment.
Future research directions include validation on prospective clinical and wearable-device ECG streams, exploration of lightweight attention and pruning techniques, and interpretability tools such as Grad-CAM and attention visualizations to facilitate clinical adoption (Qin et al., 2024).