ECG Encoder Architectures

Updated 14 December 2025

ECG Encoder is a computational module that transforms multichannel ECG waveforms into dense latent representations using deep neural architectures.
It employs advanced techniques such as transformer models, CNNs, autoencoders, and contrastive learning to enhance diagnostic accuracy and efficiency.
State-of-the-art encoders improve robustness and interpretability through domain-specific preprocessing, discrete tokenization, and efficient hardware integration.

An electrocardiogram (ECG) encoder is a computational module that transforms raw multichannel ECG waveforms—typically 10-second, 12-lead signals sampled at 500 Hz—into dense latent representations suitable for downstream tasks such as automated diagnosis, report generation, classification, zero-shot retrieval, cross-modal alignment, or generative modeling. Advanced ECG encoders now incorporate deep neural architectures (1D Vision Transformers, CNNs, autoencoders, quantizers), explicit multimodal alignment, discrete tokenization, or attention mechanisms. Encoder design determines the tractability, robustness, scalability, and interpretability of modern cardiovascular AI pipelines.

1. Architectural Foundations of ECG Encoders

State-of-the-art ECG encoders deploy deep neural backbones, with two dominant classes:

Transformer-based encoders: "ECG-Chat" applies a 12-layer 1D Vision Transformer (1D-ViT) with 0.1 s patches (patch size 50 at 500 Hz), hidden size 768, 12 self-attention heads, and feed-forward inner dimension 3072. Each raw ECG is zero-padded or truncated to 10 s, segmented, linearly projected, and passed through positional embeddings. The [CLS] token is then mapped through a frozen 2-layer adapter to a 512-dimensional latent space for alignment and downstream tasks (Zhao et al., 16 Aug 2024). "K-MERL" uses a ViT-Tiny backbone with explicit lead- and segment-tokenization for arbitrary lead inputs (Liu et al., 25 Feb 2025).
CNN/Autoencoder-based encoders: "ASCNet-ECG" uses 1D convolutions (e.g., three blocks with kernels of size 16, progressively downsampling) followed by attention modules and skip connections for denoising (Badiger et al., 2023). VAE-based approaches segment representative beats and project them via multi-layer 1D CNNs and fully-connected heads into low-dimensional latent spaces (e.g., 25- or 30-dimensional, with KL regularization scheduling) (Harvey et al., 3 Oct 2024, Harvey et al., 31 Jul 2025, Kuznetsov et al., 2020).

Most encoders augment base architectures with domain-specific processing. For example, waveform-data enhancement (WDE) appends quantitative interval and amplitude features to text for contrastive alignment (Zhao et al., 16 Aug 2024), while segment/lead masking, channel/spatial attention, and lead-specific embeddings significantly enhance robustness (Liu et al., 25 Feb 2025, Badiger et al., 2023).

2. Contrastive and Multimodal Learning Strategies

Modern ECG encoders increasingly operate in dual-encoder schemes to align ECG signals with free-text medical reports via contrastive learning objectives, primarily InfoNCE or variants:

Contrastive InfoNCE Loss: In "ECG-Chat", normalized ECG ( $x_i \in \mathbb{R}^{512}$ ) and text ( $y_j \in \mathbb{R}^{512}$ ) embeddings within each batch optimize

$L_{e2t} = \sum_{i=1}^N -\log \left[ \frac{\exp(x_i^\top y_i / \sigma)}{\sum_{j=1}^N \exp(x_i^\top y_j / \sigma)} \right]$

with similar text $\to$ ECG loss and combined objective

$\mathcal{L}_{con} = \frac{L_{e2t} + L_{t2e}}{N}$

Auxiliary autoregressive captioning ( $\mathcal{L}_{cap}$ ) supports report generation. WDE prevents collapse when reports repeat by injecting patch-level waveform statistics (Zhao et al., 16 Aug 2024).

Multimodal Self-Supervision: Cross-modal frameworks "ETP" (Liu et al., 2023), "METS" (Li et al., 2023), and "K-MERL" (Liu et al., 25 Feb 2025) employ lead-aware masking, textual entity mining via LLMs, or fusion via transformer query networks, aligning semantics between cardiac signals and structured diagnosis entities.

These protocols achieve robust out-of-distribution and zero-shot transfer: "ECG-Chat" retrieval R@1 on PTB-XL (ECG→Report) reaches 64.7%, while "K-MERL" delivers +16% AUC gain in single-lead zero-shot settings compared to prior MERL (Liu et al., 25 Feb 2025, Zhao et al., 16 Aug 2024).

3. Discrete Encoding, Tokenization, and LLM Bridging

A growing paradigm transforms ECG signals into interpretable discrete representations for direct ingestion by LLMs:

Universal language tokenization: "ECG-aBcDe" encodes key-point voltages and intervals into alternating lowercase and uppercase tokens over 26 bins per feature, yielding a universal ECG "language" for direct input to any LLM. This enables bidirectional conversion, explicit duration representation, and interpretable attention heatmaps. Cross-dataset BLEU-4 is 2.8–3.9× higher than prior baselines (Xia et al., 16 Sep 2025).
Discrete symbolic quantization: "DiagECG" and "ECG-Byte" compress continuous ECG embeddings using fixed-scale quantizers or byte-pair encoding. FSQ packs each time-step into D-dimensional cubes over L levels (e.g., D=4, L=16) → K=L^D ECG tokens. ECG-Byte leverages BPE (3,500 merges) for 6–12× compression, yielding a 3,756-token vocabulary directly consumed by LLMs with efficient trie-based interpretability (Yang et al., 21 Aug 2025, Han et al., 18 Dec 2024).
LLM integration and multimodal fusion: Discrete ECG tokens extend the LLM vocabulary. End-to-end training enables instruction tuning, report generation, and ECG-QA, while freezing the text backbone and tuning LoRA adapters. This yields high EM, BLEU, and QA metrics at reduced training time and data requirements (Han et al., 18 Dec 2024, Yang et al., 21 Aug 2025).

4. Robustness, Data Efficiency, and Specialized Feature Extraction

Robust signal encoding underpins encoder performance in clinical and noisy environments:

Artefact-invariant encoding: DP encoding emits signed spikes at first/second derivative zero crossings, conferring invariance to baseline drift, shift, and gain rescaling. AUC remains at 0.91 under strong artefact, compared to 0.62 for plain input (Shea et al., 26 Apr 2024).
Autoencoder-derived features for small datasets: Stochastic and β-scheduled VAEs (SAE, Aβ-VAE, Cβ-VAE) reconstruct representative beats with near-signal-noise MAE (~15.7 μV), outperforming PCA and facilitating robust downstream prediction with minimal training data. Coupling SAE codes and summary ECG features gives AUROC 0.901 for reduced LVEF, nearly matching CNN-based pipelines (Harvey et al., 3 Oct 2024, Harvey et al., 31 Jul 2025).
Knowledge-driven zero-shot transfer: Multimodal approaches ("K-MERL", "ECG-Chat") enable generalization across missing leads, unseen diagnoses, or new languages by integrating structured diagnostic knowledge during pretraining and fine-tuning, with dynamic masking, cardiac entity mining, and cross-attention fusion (Liu et al., 25 Feb 2025, Zhao et al., 16 Aug 2024).

5. Pretraining, Augmentation, and Performance Benchmarks

Encoder pretraining protocols critically impact representation quality:

Dataset scale: "ECG-Chat" shows retrieval R@1 scales from 0.86% at 80k pairs to 64.7% at 805k pairs; model size has lesser impact than data scale (Zhao et al., 16 Aug 2024).
Augmentation optimization: Contrastive representation studies demonstrate that augmentation intensity (e.g., Gaussian noise σ ≈ 0.1–0.2, permutation m = 4–10 segments, time warping factor tuned per dataset) markedly improves generalizability; too weak/strong augmentation harms performance (Soltanieh et al., 2022).
Empirical results: Benchmarks span linear-probe F1/AUC, cross-modal retrieval, zero-shot classification on PTB-XL, CPSC2018, MIT-BIH, NLG metrics (BLEU-4, ROUGE-L), and computational efficiency (e.g., ECG-Byte trains 3× faster, with 48% less data than two-stage approaches). See table below for selected results.

Encoder / Task	Downstream Metric	Value(s)
ECG-Chat (PTB-XL test)	ECG→Report R@1	64.7%
ECG-Chat (PTB-XL)	F1 / AUC (Super)	72.2 / 90.59
ECG-aBcDe (PTB-XL/MIMIC)	BLEU-4 (in/cross-dataset)	42.88 / 30.76
DP encoding (PTB-XL)	AUC under shift/rescaling	0.91
SAE+LGBM (LVEF pred)	AUROC	0.901
MCMA (single-lead→12)	F1 (lead I/II input)	0.8319 / 0.7824
Q-Heart (QA-EM)	Exact Match accuracy	+4% over SOTA
DiagECG (PTB-XL)	QA-Verify EM / BLEU–4	72.66% / 33.60
ECG-Byte (PTB-XL)	BLEU-4 / BertScore F1	13.93 / 92.53

6. Interpretability, Token-Level Reasoning, and Hardware Integration

Recent encoders enhance interpretability and practical deployment:

Token-level attention visualization: ECG-aBcDe enables bidirectional mapping between ECG signals and their tokenized representations, allowing attention weights to reveal which segments the LLM focuses on (e.g., RR-interval tokens), closely mirroring clinical reasoning (Xia et al., 16 Sep 2025).
Overcomplete basis learning: NRCED learns an empirically overcomplete basis in the ECG spectrogram space, enabling both forward/reverse reconstruction and unsupervised classification via basis coefficients (Banta et al., 2020).
Efficient hardware encoding: ECG-TEM uses sub-Nyquist integrate-and-fire time encoding machines with robust pre-filtering, compressing ECG streams by 20×–40× while retaining full reconstruction and heart-rate tracking. Analog hardware prototype draws <5 mW, validating the approach for long-term wearable monitoring (Naaman et al., 22 May 2024).

7. Trends, Limitations, and Future Directions

The field is progressing toward universal, robust, and semantically aligned ECG encoding frameworks:

Trends: Direct integration with LLMs via universal tokenizations (Xia et al., 16 Sep 2025, Han et al., 18 Dec 2024), knowledge-driven multimodal frameworks (Zhao et al., 16 Aug 2024, Liu et al., 25 Feb 2025), artefact-invariant signal encoding (Shea et al., 26 Apr 2024), and hardware-support for ultra-low-power continuous monitoring (Naaman et al., 22 May 2024).
Limitations: Some transformers do not capture ultra-long dependencies (>10s), discrete tokenizations may lose fine-grained waveform details, and multimodal alignment requires large paired datasets. Rare pathologies or non-standard lead setups are still challenging (Zhao et al., 16 Aug 2024, Liu et al., 25 Feb 2025).
Future Directions: Universal interpretable ECG-language representations, continual/few-shot adaptation, graph-augmented retrieval and reasoning, integration with wearable hardware, and multi-modal fusion for comprehensive patient-centric AI diagnostics are active areas of development.

These rapid advances in ECG encoder technologies are fundamentally reshaping both clinical cardiovascular AI workflows and biomedical language modeling, bridging the gap between signal-level measurements and human-interpretable diagnostics (Zhao et al., 16 Aug 2024, Xia et al., 16 Sep 2025).