HyperAdAgFormer: Adaptive MIL Transformer
- The paper introduces HyperAdAgFormer, which uses patient-specific hypernetwork parameters for adaptive feature aggregation in MIL-based medical prediction.
- It employs a four-module architecture combining ResNet feature extraction, tabular data representation, hypernetwork parameter synthesis, and a Transformer for dynamic aggregation.
- Experimental results show improved performance with F1 of 0.570 and AUC of 0.710, underscoring the benefits of integrating clinical context into attention mechanisms.
The Hypernetwork-based Adaptive Aggregation Transformer (HyperAdAgFormer) is a multimodal deep learning framework for Multiple-instance Learning (MIL) that introduces patient-specific, tabular-conditioned adaptation in feature aggregation for medical decision prediction. HyperAdAgFormer was developed to estimate the necessity of debulking coronary artery calcifications from computed tomography (CT) based on both image data and tabular clinical variables, leveraging a hypernetwork to tailor aggregation and classification for each patient (Shiku et al., 29 Jan 2026).
1. Architectural Design
HyperAdAgFormer consists of four main modules: instance feature extraction, tabular-data representation, hypernetwork parameter synthesis, and adaptive aggregation via a Transformer.
- Instance Feature Extraction: Each patient ("bag" ) comprises cross-sectional CT image patches , preprocessed by center-cropping around vessel contours, resizing to , and normalization with ImageNet statistics. Each patch is processed with a ResNet-18 (ImageNet-pretrained, truncated before its final classifier), mapping to a -dimensional feature vector .
- Tabular Data Representation: Each patient is described by a 20-dimensional vector , including variables such as age, sex, heart failure presence, and left ventricular ejection fraction. Missing values are imputed via training-set median and all features are z-score normalized.
- Hypernetwork: A three-layer MLP ($256$ hidden units, three output heads) receives and emits patient-specific parameters: the Tabular-conditioned Transformation Parameters (TCTP) , linear classifier weight , and bias . Initialization of the hypernetwork adheres to the scheme from Chang et al. (ICLR'20), promoting stable optimization for generated weights.
- Adaptive Aggregation Transformer (AdAgFormer): A learnable aggregation token is combined with :
The stack is processed by a (one-layer) Transformer encoder (8 heads, 2-layer FFN with $4d$ hidden, dropout 0.1, LayerNorm, residuals). The output at the aggregation token index is the patient’s bag representation.
- Patient-specific Classification:
with binary output via the sigmoid function.
2. Mathematical Formulation
- Hypernetwork Mapping:
wherein produces .
- Adaptive Attention: The attention score of instance to the adapted token is:
$a_j^i = \mathrm{softmax}\left(\left(\tilde{\mathbf{a}}^i\right)^\top \mathbf{K}(\mathbf{e}_j^i)\right),\$
, where is the key projection. For a single head,
and the updated bag feature is
though in implementation, the transformed aggregation token is read out directly.
- Loss: Binary cross-entropy over bags:
where denotes the sigmoid. No additional regularization is applied beyond Adam weight decay ().
3. Training and Implementation Protocols
- Preprocessing: CT patches are center-cropped, resized, and normalized. Tabular variables are imputed and z-scored.
- Batching: Full-bag processing for each patient, batch size 16; every patient's all CT slices (range 9–635, mean ~230) are included.
- Optimization: Adam optimizer, learning rate , weight decay , early stopping with validation AUC patience of 50 epochs. Training uses five-fold cross-validation with a 3:1:1 train/val/test split.
- Forward–Backward Pass: The training process involves extracting instance features, generating patient-specific hypernetwork parameters, transformer-based feature aggregation, per-patient classification, and gradient-driven updates. No instance subsampling is used.
4. Experimental Results
The primary dataset comprises 493 patients, each with 9–635 CT slices. The following summarizes comparative and ablation results for bag-level F1 and AUC metrics:
| Method | F1 | AUC |
|---|---|---|
| TableMLP (tabular) | 0.464 | 0.578 |
| Output+Max pooling | 0.474 | 0.606 |
| Feature+Mean pooling | 0.508 | 0.642 |
| Feature+Max pooling | 0.479 | 0.549 |
| Feature+Attention (Ilse’18) | 0.498 | 0.659 |
| Feature+Transformer (static) | 0.544 | 0.667 |
| Concat late-fusion | 0.558 | 0.688 |
| Gated Attention Fusion | 0.547 | 0.686 |
| MultimodalTransformer | 0.521 | 0.676 |
| M3IFusion | 0.517 | 0.667 |
| HEALNet (cross-attn) | 0.492 | 0.642 |
| HyperAdAgFormer (full) | 0.570 | 0.710 |
Ablation experiments:
- Without the hypernetwork ("Feature+Transformer"): F1/AUC = 0.544/0.667
- Hypernetwork restricted to classifier ( only): F1/AUC = 0.557/0.692
- Full pipeline (, , ): F1/AUC = 0.570/0.710
Qualitative attention analysis shows that for low-risk patients (e.g., high LVEF, no HF), HyperAdAgFormer focuses attention on subtle, small calcified regions that static attention modules tend to ignore.
5. Contextual Significance and Distinctive Features
HyperAdAgFormer addresses the heterogeneity in clinical practice by enabling per-patient adaptation of MIL aggregation based on extensive tabular context. The TCTP vector modulates the aggregation token, thus shaping the instance attention profile according to patient-specific factors. Simultaneously, patient-specific classifier weights and biases (, ) allow for downstream decision criteria to be tailored beyond simple static thresholds. This design contrasts with prior multimodal MIL approaches, which typically employ static pooling or fusion schemes without individualized adaptation based on clinical data.
The ablation study quantifies the incremental value of each hypernetwork output: most notably, adapting the aggregation token () in addition to classifier weights is critical for the best performance.
6. Limitations and Future Research Directions
HyperAdAgFormer presently relies on complete tabular data with missing variables handled by median imputation only. A plausible implication is that more principled probabilistic handling of missingness (e.g., variational imputation) could further improve robustness and generalization, especially in real-world electronic health record scenarios. The model employs a shallow (single-layer) Transformer and a basic hypernetwork MLP, leaving the exploration of deeper/larger architectures unaddressed. The private dataset (493 patients) limits broad external validation, necessitating future studies on larger, multi-center cohorts. Future work is planned to extend HyperAdAgFormer with explicit modeling or marginalization of missing tabular features and to introduce patient-specific uncertainty quantification in both aggregation and prediction (Shiku et al., 29 Jan 2026).