HyperAdAgFormer: Adaptive MIL Transformer

Updated 5 February 2026

The paper introduces HyperAdAgFormer, which uses patient-specific hypernetwork parameters for adaptive feature aggregation in MIL-based medical prediction.
It employs a four-module architecture combining ResNet feature extraction, tabular data representation, hypernetwork parameter synthesis, and a Transformer for dynamic aggregation.
Experimental results show improved performance with F1 of 0.570 and AUC of 0.710, underscoring the benefits of integrating clinical context into attention mechanisms.

The Hypernetwork-based Adaptive Aggregation Transformer (HyperAdAgFormer) is a multimodal deep learning framework for Multiple-instance Learning (MIL) that introduces patient-specific, tabular-conditioned adaptation in feature aggregation for medical decision prediction. HyperAdAgFormer was developed to estimate the necessity of debulking coronary artery calcifications from computed tomography (CT) based on both image data and tabular clinical variables, leveraging a hypernetwork to tailor aggregation and classification for each patient (Shiku et al., 29 Jan 2026).

1. Architectural Design

HyperAdAgFormer consists of four main modules: instance feature extraction, tabular-data representation, hypernetwork parameter synthesis, and adaptive aggregation via a Transformer.

Instance Feature Extraction: Each patient ("bag" $\mathcal{B}^i$ ) comprises $n_i$ cross-sectional CT image patches $\{\mathbf{x}^i_j\}_{j{=}1}^{n_i}$ , preprocessed by center-cropping around vessel contours, resizing to $224{\times}224$ , and normalization with ImageNet statistics. Each patch is processed with a ResNet-18 (ImageNet-pretrained, truncated before its final classifier), mapping $\mathbf{x}^i_j\in\mathbb{R}^{3\times224\times224}$ to a $d$ -dimensional feature vector $\mathbf{e}^i_j\in\mathbb{R}^{512}$ .
Tabular Data Representation: Each patient is described by a 20-dimensional vector $\mathbf{T}^i\in\mathbb{R}^{20}$ , including variables such as age, sex, heart failure presence, and left ventricular ejection fraction. Missing values are imputed via training-set median and all features are z-score normalized.
Hypernetwork: A three-layer MLP ($256$ hidden units, three output heads) receives $\mathbf{T}^i$ and emits patient-specific parameters: the Tabular-conditioned Transformation Parameters (TCTP) $\mathbf{v}^i\in\mathbb{R}^d$ , linear classifier weight $\mathbf{W}^i\in\mathbb{R}^{d\times1}$ , and bias $b^i\in\mathbb{R}$ . Initialization of the hypernetwork adheres to the scheme from Chang et al. (ICLR'20), promoting stable optimization for generated weights.
Adaptive Aggregation Transformer (AdAgFormer): A learnable aggregation token $\mathbf{a}\in\mathbb{R}^{d}$ is combined with $\mathbf{v}^i$ :

$\tilde{\mathbf{a}}^i = \mathbf{a} + \mathbf{v}^i.$

The stack $[\tilde{\mathbf{a}}^i;\mathbf{e}^i_1;\dots;\mathbf{e}^i_{n_i}]$ is processed by a (one-layer) Transformer encoder (8 heads, 2-layer FFN with $4d$ hidden, dropout 0.1, LayerNorm, residuals). The output $\mathbf{q}^i$ at the aggregation token index is the patient’s bag representation.

Patient-specific Classification:

$\hat Y^i = (\mathbf{q}^i)^\top \mathbf{W}^i + b^i$

with binary output via the sigmoid function.

2. Mathematical Formulation

Hypernetwork Mapping:

$(\mathbf{v}^i, \mathbf{W}^i, b^i) = h(\mathbf{T}^i) = \bigl[ f_{\theta_v}(\mathbf{T}^i),\ f_{\theta_W}(\mathbf{T}^i),\ f_{\theta_b}(\mathbf{T}^i) \bigr]$

wherein $f_{\theta_W}$ produces $\mathbf{W}^i\in\mathbb{R}^{d\times1}$ .

Adaptive Attention: The attention score of instance $j$ to the adapted token is:

$a_j^i = \mathrm{softmax}\left(\left(\tilde{\mathbf{a}}^i\right)^\top \mathbf{K}(\mathbf{e}_j^i)\right),\$

$\mathbf{K}(\mathbf{e}){=}W_K\mathbf{e}$ , where $W_K$ is the key projection. For a single head,

$a_j^i = \frac{\exp\left((\tilde{\mathbf{a}}^i)^\top W_K\,\mathbf{e}_j^i\right)}{\sum_{m=1}^{n_i} \exp\left((\tilde{\mathbf{a}}^i)^\top W_K\,\mathbf{e}_m^i\right)},$

and the updated bag feature is

$\mathbf{q}^i = \sum_{j=1}^{n_i} a_j^i\,W_V\,\mathbf{e}_j^i\ (\text{plus feed-forward/residuals}),$

though in implementation, the transformed aggregation token is read out directly.

Loss: Binary cross-entropy over bags:

$\mathcal{L} = \frac{1}{N}\sum_{i=1}^N \Big[ -Y^i \log \sigma(\hat Y^i) - (1-Y^i)\log(1-\sigma(\hat Y^i)) \Big],$

where $\sigma$ denotes the sigmoid. No additional regularization is applied beyond Adam weight decay ( $10^{-4}$ ).

3. Training and Implementation Protocols

Preprocessing: CT patches are center-cropped, resized, and normalized. Tabular variables are imputed and z-scored.
Batching: Full-bag processing for each patient, batch size 16; every patient's all CT slices (range 9–635, mean ~230) are included.
Optimization: Adam optimizer, learning rate $3 \times 10^{-6}$ , weight decay $10^{-4}$ , early stopping with validation AUC patience of 50 epochs. Training uses five-fold cross-validation with a 3:1:1 train/val/test split.
Forward–Backward Pass: The training process involves extracting instance features, generating patient-specific hypernetwork parameters, transformer-based feature aggregation, per-patient classification, and gradient-driven updates. No instance subsampling is used.

4. Experimental Results

The primary dataset comprises 493 patients, each with 9–635 CT slices. The following summarizes comparative and ablation results for bag-level F1 and AUC metrics:

Method	F1	AUC
TableMLP (tabular)	0.464	0.578
Output+Max pooling	0.474	0.606
Feature+Mean pooling	0.508	0.642
Feature+Max pooling	0.479	0.549
Feature+Attention (Ilse’18)	0.498	0.659
Feature+Transformer (static)	0.544	0.667
Concat late-fusion	0.558	0.688
Gated Attention Fusion	0.547	0.686
MultimodalTransformer	0.521	0.676
M3IFusion	0.517	0.667
HEALNet (cross-attn)	0.492	0.642
HyperAdAgFormer (full)	0.570	0.710

Ablation experiments:

Without the hypernetwork ("Feature+Transformer"): F1/AUC = 0.544/0.667
Hypernetwork restricted to classifier ( $\mathbf{W}^i, b^i$ only): F1/AUC = 0.557/0.692
Full pipeline ( $\mathbf{v}^i$ , $\mathbf{W}^i$ , $b^i$ ): F1/AUC = 0.570/0.710

Qualitative attention analysis shows that for low-risk patients (e.g., high LVEF, no HF), HyperAdAgFormer focuses attention on subtle, small calcified regions that static attention modules tend to ignore.

5. Contextual Significance and Distinctive Features

HyperAdAgFormer addresses the heterogeneity in clinical practice by enabling per-patient adaptation of MIL aggregation based on extensive tabular context. The TCTP vector $\mathbf{v}^i$ modulates the aggregation token, thus shaping the instance attention profile according to patient-specific factors. Simultaneously, patient-specific classifier weights and biases ( $\mathbf{W}^i$ , $b^i$ ) allow for downstream decision criteria to be tailored beyond simple static thresholds. This design contrasts with prior multimodal MIL approaches, which typically employ static pooling or fusion schemes without individualized adaptation based on clinical data.

The ablation study quantifies the incremental value of each hypernetwork output: most notably, adapting the aggregation token ( $\mathbf{v}^i$ ) in addition to classifier weights is critical for the best performance.

6. Limitations and Future Research Directions

HyperAdAgFormer presently relies on complete tabular data with missing variables handled by median imputation only. A plausible implication is that more principled probabilistic handling of missingness (e.g., variational imputation) could further improve robustness and generalization, especially in real-world electronic health record scenarios. The model employs a shallow (single-layer) Transformer and a basic hypernetwork MLP, leaving the exploration of deeper/larger architectures unaddressed. The private dataset (493 patients) limits broad external validation, necessitating future studies on larger, multi-center cohorts. Future work is planned to extend HyperAdAgFormer with explicit modeling or marginalization of missing tabular features and to introduce patient-specific uncertainty quantification in both aggregation and prediction (Shiku et al., 29 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Hypernetwork-Based Adaptive Aggregation for Multimodal Multiple-Instance Learning in Predicting Coronary Calcium Debulking (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hypernetwork-based Adaptive Aggregation Transformer (HyperAdAgFormer).