Semantic Compensation via Adversarial Removal for Robust Zero-Shot ECG Diagnosis

Published 2 Apr 2026 in cs.MM | (2604.01498v1)

Abstract: Recent ECG--language pretraining methods enable zero-shot diagnosis by aligning cardiac signals with clinical text, but they do not explicitly model robustness to partial observation and are typically studied under fully observed ECG settings. In practice, diagnostically critical leads or temporal segments may be missing due to electrode detachment, motion artifacts, or signal corruption, causing severe degradation of cross-modal semantic alignment. In this paper, we propose \textbf{SCAR}, a robust ECG--language pretraining framework for \textbf{S}emantic \textbf{C}ompensation via \textbf{A}dversarial \textbf{R}emoval. SCAR improves robustness by explicitly training the model to remain semantically aligned with semantically critical missingness and to recover diagnostic meaning from the remaining visible evidence. Specifically, we introduce a differentiable adversarial masker to remove the most alignment-critical spatio-temporal ECG tokens during training, forcing the ECG encoder to learn representations that remain semantically aligned with clinical text even when primary diagnostic evidence is missing. Under such adversarial corruption, we equip the ECG encoder with a semantically supervised adaptive selector that learns to reweight the remaining visible tokens and compensate with secondary yet diagnostically informative morphological cues. To evaluate robustness beyond classification accuracy, we further introduce Counterfactual Missingness Resolution Score (CMRS), which quantifies how well feature preserve diagnostic semantics under missingness. Experiments on $6$ datasets show that SCAR consistently improves semantic robustness under joint lead and temporal missingness, with particularly clear advantages in harder cases where primary diagnostic evidence is unavailable, while also yielding stronger linear-probing transferability.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces SCAR, a framework that uses adversarial masking to remove key ECG tokens, simulating critical missing data.
It utilizes a semantically supervised adaptive selector to reweight remaining signals, ensuring robust cross-modal semantic alignment.
Experiments on multiple datasets show significant AUROC and CMRS improvements, proving enhanced robustness under random and adversarial missingness.

Semantic Compensation via Adversarial Removal for Robust Zero-Shot ECG Diagnosis

Problem Setting and Motivation

Recent works on ECG–language pretraining utilize joint modeling of cardiac signals and clinical text to yield models capable of zero-shot diagnosis, reducing reliance on annotated labels and supporting cross-dataset transfer. However, these methods typically assume fully observed ECG input, disregarding clinically common scenarios where diagnostically critical leads or temporal segments are missing due to artifacts or acquisition faults. As depicted in Figure 1, random or mild missingness often induces only limited semantic drift, but the removal of primary diagnostic evidence precipitates severe misalignment in the shared representation space.

Figure 1: Motivation for robust ECG-language alignment under diagnostically critical missingness.

This problem necessitates new approaches explicitly designed to maintain cross-modal semantic alignment even when the cues supporting a given diagnosis are partially or severely missing.

SCAR: Adversarial Masking and Semantic Compensation

The core contribution is SCAR (Semantic Compensation via Adversarial Removal), a robust pretraining framework that enforces semantic alignment between ECG signals and clinical text under adversarial missingness. SCAR incorporates two key mechanisms:

Differentiable Adversarial Masker: During pretraining, instead of applying naive random mask corruption, a learned adversarial masker (parameterized via a Gumbel–Sigmoid relaxation for differentiability) aggressively removes those spatio-temporal ECG tokens that are most alignment-critical to the diagnostic semantics, subject to a fixed masking budget.
Semantically Supervised Adaptive Selector: After adversarial masking, an adaptive selector reweights the remaining ECG tokens, promoting the aggregation of secondary but still discriminative morphological cues. This selector is not conditioned on text at test time; its parameters are optimized via report-level contrastive supervision and full-view consistency alignment.

The joint training objective is formulated as a constrained min–max game: the masker maximizes cross-modal misalignment, while the ECG encoder and selector minimize it by compensating with available evidence. The report encoder provides global semantic targets via contrastive learning, and a consistency loss enforces that the masked-view ECG embedding remains close to the full-view embedding.

Figure 2: SCAR architecture employs adversarial masking of critical tokens and adaptive aggregation for robust ECG–text semantics.

Experimental Setup and Metrics

SCAR is pretrained on MIMIC-IV-ECG (800K+ records) and evaluated on PTB-XL, CPSC2018, and Chapman–Shaoxing–Ningbo datasets under two transfer protocols: zero-shot classification (prompt-based inference) and linear probing. During both training and evaluation, missingness is simulated along both the lead and temporal axes (with random and adversarial/hard masking schemes).

As standard metrics can be misleading under missingness, the authors introduce the Counterfactual Missingness Resolution Score (CMRS)—a semantic robustness metric quantifying preservation of oracle (full-view) diagnostic semantics under missingness, using predictions from an independent strong full-view reference model as privileged targets.

Results

Linear Probing

SCAR shows clear improvements in transfer performance, especially under low-label regimes (1%, 10% supervision), achieving up to a 13.24-point AUROC increase over the previous state-of-the-art (MERL, MELP) on PTBXL-Rhythm and similar margins on CPSC2018 and CSN.

Zero-Shot and Robustness Evaluation

Under both random and hard (adversarial) missingness, SCAR consistently surpasses prior baselines:

Under random missingness, SCAR achieves AUROC/CMRS of 88.82/79.68 (PTBXL-Rhythm) and 78.08/80.25 (CSN), with 5–15 points improvement over MELP/MERL.
Under hard missingness, where primary evidence is systematically ablated, SCAR’s advantage grows: e.g., AUROC/CMRS of 83.12/68.45 on PTBXL-Rhythm, with prior baselines generally declining more sharply.

These robustness gains are particularly evident in CMRS, where SCAR often doubles the baseline score, indicating that its embeddings retain much more of the original diagnostic semantic content when input evidence is compromised.

Semantic and Token-Level Probing

Embedding visualizations (Figure 3) demonstrate that SCAR produces globally more discriminative clusters and improved inter-class separability under missingness compared to MELP. Token-level importance maps (Figure 4) show that, after adversarial removal, the selector appropriately reallocates weight to secondary morphological regions, verifying the effectiveness of the semantically supervised adaptive compensation mechanism.

Figure 3: SCAR’s representations yield more compact intra-class clusters and greater class separation under missingness.

Figure 4: The model compensates for adversarially masked primary evidence by reallocating attention to preserved, informatively compensatory tokens.

Ablations and Analysis

Ablation studies validate that each component—adversarial masking, semantic consistency regularization, and adaptive selection—contributes to robustness. Variants omitting adversarial masking or adaptive selection see significant drops (up to 50 points in CMRS under hard missingness).

Further analyses show that the method’s robustness generalizes to both lead and temporal missingness, is stable across sensible hyperparameter settings, and incurs only moderate additional computational cost over prior ECG-language foundation models.

Practical and Theoretical Implications

SCAR reframes robust multimodal pretraining, especially in biomedical time series, by operationalizing semantic resilience not merely as performance under average case corruption, but as the preservation of semantic consistency with full-view supervision under worst-case adversarial missingness.

Practically, SCAR’s design admits deployment in clinical scenarios with naturally frequent lead dropouts or signal corruption, conferring greater reliability in label-limited and open-set situations, and better supporting downstream diagnostic tasks even under degraded acquisition. The semantic robustness to adversarial, context-sensitive missingness elevates the standard for foundation model evaluation in the clinical AI domain.

Theoretically, SCAR’s methodology points to a family of adversarial robustification/supervision paradigms that could be extended to other modalities, multiview or multimodal data, or more general settings of distributional shift with variable information loss. The notion of “semantic compensation” via adversarial missingness masks aligns naturally with research in robust representation learning and missing modality imputation, inviting further work on task-aware masking policies and adaptive inference.

Conclusion

SCAR establishes a new training and evaluation paradigm for robust multimodal alignment under partial observation in ECG–LLMs. By combining adversarial removal of critical evidence with semantically supervised compensatory aggregation, SCAR achieves substantial improvements in zero-shot and transfer performance under both average and adversarial missingness scenarios. Its introduction of CMRS as a semantic robustness metric complements conventional AUROC, offering a more rigorous and clinically meaningful assessment standard. This framework is a compelling step toward practical, resilient medical AI foundation models, and motivates future extensions to other domains with complex, multimodal, and inherently incomplete signals.

Markdown Report Issue