HEART-ViT: Vision Transformers for Cardiac Analysis

Updated 30 December 2025

HEART-ViT is a framework that integrates Vision Transformers with dynamic pruning and modal decomposition to handle both computer vision and biomedical challenges under limited data resources.
It employs Hessian-guided token and head pruning for adaptive inference, achieving up to 49.4% FLOPs reduction and significantly lower latency on benchmark datasets.
In biomedical applications, HEART-ViT uses Higher-Order DMD to reduce noise and enhance data augmentation, leading to improved cardiac pathology detection and heart failure prediction.

HEART-ViT is a family of frameworks leveraging Vision Transformers (ViTs), modal decomposition (notably Higher-Order Dynamic Mode Decomposition, HODMD), loss-aware attention/token pruning, and joint self-supervised and regression objectives to address both computer vision and biomedical tasks under limited computational or data regimes. It encompasses two main branches: (1) Hessian-guided dynamic pruning for efficient and adaptive ViT inference (Uddin et al., 23 Dec 2025), and (2) physics-informed cardiac pathology recognition and heart failure prediction from echocardiography using modal decomposition and ViT-based models, with specialized augmentation and SSL strategies (Bell-Navas et al., 2024, Bell-Navas et al., 10 Apr 2025).

1. Second-Order Sensitivity-Guided Optimization of Vision Transformers

HEART-ViT introduces a unified, second-order framework for dynamic attention head and token pruning in ViTs. Unlike conventional approaches that use first-order heuristics, it estimates curvature-weighted sensitivity of intermediate activations—token embeddings and attention heads—via efficient Hessian-vector products (HVP).

Formally, for a pretrained ViT $f_\theta(x)$ , the expected loss is $\mathcal{L}(\theta)$ , and for any activation $z$ (token/head), removing $z$ incurs a second-order loss increment:

$\Delta\mathcal{L}_z \approx \frac{1}{2} z(x)^\top \mathcal{H}_z(x) z(x)$

where the unnormalized sensitivity score is $S_z(x) = z(x)^\top\,\mathcal{H}_z(x)\,z(x)$ . This framework allows explicit input-adaptive pruning under a loss budget and unifies token/head importance ranking.

2. Dynamic Pruning Algorithm and Complexity Characterization

The pruning process in HEART-ViT consists of three stages:

Calibration: Computes expected second-order sensitivities for all heads and tokens over a small batch, derives thresholds by percentile or loss budget.
Inference: For each test sample, recomputes sensitivities, standardizes, selects and hard-masks top tokens/heads per layer.
Fine-tuning (optional): Soft gates refine gradients using standardized sensitivities, with annealing to achieve hard masks.

Table: Representative Inference and Calibration Steps

Stage	Inputs/Operations	Outputs
Calibration	Batch $\mathcal{B}$ , HVP per token/head	Sensitivity map
Inference	Single sample, cache activations, HVP	Hard masks
Fine-tuning	Training minibatch, soft gating over sensitivities	Updated gates

Pruning tokens yields a quadratic reduction in FLOPs ( $\alpha^2$ ), while head pruning achieves linear savings ( $\beta$ ), indicating token pruning dominates computational and latency improvements.

3. Experimental Benchmarks and Edge-Device Deployment

Extensive evaluation on ImageNet-100 and ImageNet-1K using ViT-B/16 and DeiT-B/16 demonstrates that HEART-ViT achieves up to 49.4% FLOPs reduction, 36% lower latency, and 46% higher throughput compared to dense baselines.

Accuracy is preserved or exceeded after fine-tuning; e.g., for symmetric 50/50 pruning on ImageNet-100: post-pruning accuracy reaches 91.00% vs. baseline 89.83%.
Asymmetric schedules (e.g., 40% token, 60% head pruning) yield superior accuracy–efficiency trade-offs.
On Jetson AGX Orin, latency is reduced substantially: 23 ms baseline to 13.8 ms at 40/60 pruning; accuracy is maintained.

Table: Symmetric vs. Asymmetric Pruning, ImageNet-100, ViT-B/16

Strategy	Ratio	Baseline Acc	After Pruning	After FT	ΔFLOPs↓	ΔAcc (Final–Base)
Symmetric	50/50	89.83%	85.83%	91.00%	49.4%	+1.17%
Asymmetric	40/60	89.83%	86.52%	90.57%	39.8%	+0.74%

In biomedical applications, HEART-ViT couples Higher-Order DMD with Vision Transformers to enable discriminative analysis and prediction on scarce echocardiography datasets (Bell-Navas et al., 2024, Bell-Navas et al., 10 Apr 2025).

The HODMD module:

Reduces noise and reveals dominant spatio-temporal modes from frame sequences via snapshot tensor formation, HOSVD-based compression, and eigen-decomposition.
Augments the training pool substantially via multiple reconstructions and raw modal images. Typical parameter choices include SVD/DMD tolerances ( $\epsilon_{\rm SVD} = \epsilon_{\rm DMD} = 5 \times 10^{-4}$ ), delay selection, and amplitude thresholding.

ViT backbones for medical use are specialized, using modules such as Shifted Patch Tokenization (SPT) to encode local spatial context and Locality Self-Attention (LSA) to concentrate model capacity on local neighborhoods, which is critical for overcoming limited sample sizes.

5. Self-Supervised Learning and Regression for Scarce Datasets

HEART-ViT extends with Masked Autoencoders (MAE) for joint self-supervised reconstruction and regression:

A single-channel input (ROI) is divided into patches, randomly masking 75% during training.
Shared encoder processes unmasked patches; lightweight decoder reconstructs masked data.
The joint loss combines MSE for masked reconstruction and regression (with loss weight $\alpha=0.1$ ).
No separate pretraining is required; standard Transformer and AdamW optimizer with cosine decay is used.

This approach yields an RMSE of 4.53 months error in heart failure prediction, outperforming both standard ViTs and CNNs, and achieves inference speeds of ≈9.7 ms/image (105 fps) (Bell-Navas et al., 10 Apr 2025).

6. Comparative Performance and Contextual Insights

In cardiac pathology classification (Bell-Navas et al., 2024):

Using HODMD-reconstructed frames, HEART-ViT attains per-image accuracy up to 61.1%, per-sequence accuracy up to 77.0%, and mean F₁ = 0.68.
Sequence-level fusion (average or max) improves robustness; the best per-class recalls range 0.71–0.86, demonstrating balanced detection across all four states.
Compared to CNNs (ResNet50-v2, Inception-v3, VGG16), HEART-ViT achieves superior accuracy and real-time prediction speeds.

A plausible implication is that modal decomposition augments data diversity and noise resilience beyond what conventional augmentation achieves, while SPT+LSA enable efficient training of compact ViT models even on sample regimes orders of magnitude below conventional benchmarks.

7. Limitations and Future Research Directions

HEART-ViT, in both its computer vision and biomedical branches, is constrained by:

Dependency on the computational cost of HODMD/HOSVD (potential bottleneck for real-time medical deployment).
Current applicability is limited to specific imaging modalities (murine PLAX, 4 cardiac pathologies; three HF classes). Extension to multi-view, human datasets and modalities (ECG, Doppler echo) is planned.
Statistical significance validation across larger, multi-site cohorts is outstanding.
Potential development includes end-to-end learnable modal filtering, efficient GPU-based DMD algorithms, and lightweight on-device implementations (Bell-Navas et al., 10 Apr 2025, Bell-Navas et al., 2024, Uddin et al., 23 Dec 2025).

In summary, HEART-ViT frameworks combine principled, loss-aware pruning and physics-based modal decomposition with ViTs and tailored SSL modules to deliver adaptive, accurate, and latency-efficient performance for both vision transformer optimization and real-time biomedical analysis under resource and data constraints.