- The paper introduces SCAR, a framework that uses adversarial masking to remove key ECG tokens, simulating critical missing data.
- It utilizes a semantically supervised adaptive selector to reweight remaining signals, ensuring robust cross-modal semantic alignment.
- Experiments on multiple datasets show significant AUROC and CMRS improvements, proving enhanced robustness under random and adversarial missingness.
Semantic Compensation via Adversarial Removal for Robust Zero-Shot ECG Diagnosis
Problem Setting and Motivation
Recent works on ECG–language pretraining utilize joint modeling of cardiac signals and clinical text to yield models capable of zero-shot diagnosis, reducing reliance on annotated labels and supporting cross-dataset transfer. However, these methods typically assume fully observed ECG input, disregarding clinically common scenarios where diagnostically critical leads or temporal segments are missing due to artifacts or acquisition faults. As depicted in Figure 1, random or mild missingness often induces only limited semantic drift, but the removal of primary diagnostic evidence precipitates severe misalignment in the shared representation space.
Figure 1: Motivation for robust ECG-language alignment under diagnostically critical missingness.
This problem necessitates new approaches explicitly designed to maintain cross-modal semantic alignment even when the cues supporting a given diagnosis are partially or severely missing.
SCAR: Adversarial Masking and Semantic Compensation
The core contribution is SCAR (Semantic Compensation via Adversarial Removal), a robust pretraining framework that enforces semantic alignment between ECG signals and clinical text under adversarial missingness. SCAR incorporates two key mechanisms:
- Differentiable Adversarial Masker: During pretraining, instead of applying naive random mask corruption, a learned adversarial masker (parameterized via a Gumbel–Sigmoid relaxation for differentiability) aggressively removes those spatio-temporal ECG tokens that are most alignment-critical to the diagnostic semantics, subject to a fixed masking budget.
- Semantically Supervised Adaptive Selector: After adversarial masking, an adaptive selector reweights the remaining ECG tokens, promoting the aggregation of secondary but still discriminative morphological cues. This selector is not conditioned on text at test time; its parameters are optimized via report-level contrastive supervision and full-view consistency alignment.
The joint training objective is formulated as a constrained min–max game: the masker maximizes cross-modal misalignment, while the ECG encoder and selector minimize it by compensating with available evidence. The report encoder provides global semantic targets via contrastive learning, and a consistency loss enforces that the masked-view ECG embedding remains close to the full-view embedding.
Figure 2: SCAR architecture employs adversarial masking of critical tokens and adaptive aggregation for robust ECG–text semantics.
Experimental Setup and Metrics
SCAR is pretrained on MIMIC-IV-ECG (800K+ records) and evaluated on PTB-XL, CPSC2018, and Chapman–Shaoxing–Ningbo datasets under two transfer protocols: zero-shot classification (prompt-based inference) and linear probing. During both training and evaluation, missingness is simulated along both the lead and temporal axes (with random and adversarial/hard masking schemes).
As standard metrics can be misleading under missingness, the authors introduce the Counterfactual Missingness Resolution Score (CMRS)—a semantic robustness metric quantifying preservation of oracle (full-view) diagnostic semantics under missingness, using predictions from an independent strong full-view reference model as privileged targets.
Results
Linear Probing
SCAR shows clear improvements in transfer performance, especially under low-label regimes (1%, 10% supervision), achieving up to a 13.24-point AUROC increase over the previous state-of-the-art (MERL, MELP) on PTBXL-Rhythm and similar margins on CPSC2018 and CSN.
Zero-Shot and Robustness Evaluation
Under both random and hard (adversarial) missingness, SCAR consistently surpasses prior baselines:
- Under random missingness, SCAR achieves AUROC/CMRS of 88.82/79.68 (PTBXL-Rhythm) and 78.08/80.25 (CSN), with 5–15 points improvement over MELP/MERL.
- Under hard missingness, where primary evidence is systematically ablated, SCAR’s advantage grows: e.g., AUROC/CMRS of 83.12/68.45 on PTBXL-Rhythm, with prior baselines generally declining more sharply.
These robustness gains are particularly evident in CMRS, where SCAR often doubles the baseline score, indicating that its embeddings retain much more of the original diagnostic semantic content when input evidence is compromised.
Semantic and Token-Level Probing
Embedding visualizations (Figure 3) demonstrate that SCAR produces globally more discriminative clusters and improved inter-class separability under missingness compared to MELP. Token-level importance maps (Figure 4) show that, after adversarial removal, the selector appropriately reallocates weight to secondary morphological regions, verifying the effectiveness of the semantically supervised adaptive compensation mechanism.
Figure 3: SCAR’s representations yield more compact intra-class clusters and greater class separation under missingness.
Figure 4: The model compensates for adversarially masked primary evidence by reallocating attention to preserved, informatively compensatory tokens.
Ablations and Analysis
Ablation studies validate that each component—adversarial masking, semantic consistency regularization, and adaptive selection—contributes to robustness. Variants omitting adversarial masking or adaptive selection see significant drops (up to 50 points in CMRS under hard missingness).
Further analyses show that the method’s robustness generalizes to both lead and temporal missingness, is stable across sensible hyperparameter settings, and incurs only moderate additional computational cost over prior ECG-language foundation models.
Practical and Theoretical Implications
SCAR reframes robust multimodal pretraining, especially in biomedical time series, by operationalizing semantic resilience not merely as performance under average case corruption, but as the preservation of semantic consistency with full-view supervision under worst-case adversarial missingness.
Practically, SCAR’s design admits deployment in clinical scenarios with naturally frequent lead dropouts or signal corruption, conferring greater reliability in label-limited and open-set situations, and better supporting downstream diagnostic tasks even under degraded acquisition. The semantic robustness to adversarial, context-sensitive missingness elevates the standard for foundation model evaluation in the clinical AI domain.
Theoretically, SCAR’s methodology points to a family of adversarial robustification/supervision paradigms that could be extended to other modalities, multiview or multimodal data, or more general settings of distributional shift with variable information loss. The notion of “semantic compensation” via adversarial missingness masks aligns naturally with research in robust representation learning and missing modality imputation, inviting further work on task-aware masking policies and adaptive inference.
Conclusion
SCAR establishes a new training and evaluation paradigm for robust multimodal alignment under partial observation in ECG–LLMs. By combining adversarial removal of critical evidence with semantically supervised compensatory aggregation, SCAR achieves substantial improvements in zero-shot and transfer performance under both average and adversarial missingness scenarios. Its introduction of CMRS as a semantic robustness metric complements conventional AUROC, offering a more rigorous and clinically meaningful assessment standard. This framework is a compelling step toward practical, resilient medical AI foundation models, and motivates future extensions to other domains with complex, multimodal, and inherently incomplete signals.