Nested MIL with Attention

Updated 20 January 2026

The paper introduces a nested MIL framework that employs dedicated attention mechanisms at every hierarchy level to enhance prediction accuracy and interpretability.
NMIA is a hierarchical model that uses multi-level feature embedding and aggregation to capture complex dependencies in weakly supervised, bag-of-bags data.
Empirical evaluations demonstrate that NMIA outperforms traditional MIL approaches, particularly in tasks requiring structured latent label inference and rule-based aggregation.

Nested Multiple Instance Learning with Attention (NMIA) extends the canonical Multiple Instance Learning (MIL) paradigm to address weakly supervised problems with complex hierarchical structures, where only bag-of-bags labels are available and neither instance nor inner-bag labels are observed. NMIA introduces $J$ levels of bag nesting and employs dedicated attention mechanisms at each level. This framework enables not only accurate prediction of the outermost bag labels but also interpretable soft predictions of latent labels at lower levels. The original model formulation and empirical analysis are detailed in "Nested Multiple Instance Learning with Attention Mechanisms" (Fuster et al., 2021).

1. Hierarchical Weak Supervision and Formal Setup

NMIA formalizes a setting where only the label $y \in \{0,1\}$ of a single outermost bag $X$ is observed, but the data structure is intrinsically hierarchical:

Level 1 (Innermost): Instances $x_{1,k,l} \in \mathbb{R}^D$ , grouped into inner-bags.
Levels 2...J–1: Each level $j$ comprises bags $X_{j,k}$ of elements from level $j-1$ .
Level J (Outermost): The top-level bag $X_{J,1}$ contains inner-bags $X_{J-1,k}$ , for $k=1...K_{J-1}$ .

Notation:

$x_{j,k,l}$ : $l$ -th element of $k$ -th bag at level $j$ ( $x_{1,k,l}$ for instance; $x_{j,k,l}$ for embedding of sub-bag if $j>1$ ).
$X_{j,k} = \{x_{j,k,l} | l=1,...,L_{j,k}\}$ : $k$ -th bag at level $j$ , $L_{j,k}$ elements.
$X = X_{J,1} = \{X_{J-1,k} | k=1...K_{J-1}\}$ .
$y^j_{k,l} \in \{0,1\}$ : latent label of $x_{j,k,l}$ (not observed).
Under standard MIL ( $J=1$ ), $y = \max_l y^1_{1,l}$ .

This nested organization generalizes MIL such that models can capture complex dependencies, like grouping similar instances or enforcing relational bag rules.

2. Model Architecture and Attention Mechanisms

NMIA employs a multi-tiered process for representation and aggregation, parameterized as follows:

2.1 Instance-level Feature Embedding

Each raw instance $x_{1,k,l}$ is embedded:

$h_{1,k,l} = f(x_{1,k,l}; \theta_f)$

where $f$ is typically a CNN or MLP.

2.2 Attention from Instance to Inner-bag

Attention scores for each instance in its inner-bag:

$a_{1,k,l} = \exp(w^\top h_{1,k,l} + b)$

$\alpha_{1,k,l} = \frac{a_{1,k,l}}{\sum_{m=1}^{L_{1,k}} a_{1,k,m}}$

with $w \in \mathbb{R}^M$ , $b \in \mathbb{R}$ .

A gated-attention variant is also considered:

$a_{1,k,l} = \exp\left[w^\top \left(\tanh(Vh_{1,k,l}) \odot \sigma(Uh_{1,k,l})\right)\right]$

with $V,U \in \mathbb{R}^{L \times M}$ , $\odot$ element-wise, $\sigma$ sigmoid.

2.3 Inner-Bag Representation Aggregation

Weighted sum for each inner-bag:

$m_{1,k} = \sum_{l=1}^{L_{1,k}} \alpha_{1,k,l} h_{1,k,l}$

2.4 Attention from Inner-bag to Outer-bag

Aggregation to the outer-bag:

$b_k = \exp(v^\top m_{1,k} + c)$

$\beta_k = \frac{b_k}{\sum_{n=1}^{K_1} b_n}$

with $v \in \mathbb{R}^{M}$ , $c \in \mathbb{R}$ .

Final bag-of-bags embedding:

$M = \sum_{k=1}^{K_1} \beta_k m_{1,k}$

2.5 Classification Head

Prediction via:

$\hat{y} = \Theta_c(M; \theta_c)$

where $\hat{y} \in [0,1]$ is the predicted probability.

3. Training Objective and Optimization

The model is trained end-to-end with combined parameters $\theta = \{\theta_f, w, b, v, c, \theta_c, ...\}$ , minimizing binary cross-entropy on outer-bag labels and optional $\ell_2$ regularization:

$L(\theta) = -[y \log \hat{y} + (1-y) \log(1-\hat{y})] + \lambda \| \theta \|_2^2$

Early stopping is typically applied using a held-out validation set.

The full process from instance embedding to nested attention aggregation is differentiable, amenable to optimization by SGD or Adam.

4. Latent Label Prediction via Hierarchical Attention

Although supervision is available only at the outer bag level, NMIA leverages nested attention for latent label inference:

Instance-level score $\alpha_{1,k,l}$ : Measures likelihood that instance $x_{1,k,l}$ is positive within its inner-bag; thresholding $\alpha_{1,k,l} > \tau_1$ enables latent positive assignment $\hat{y}^1_{k,l} = 1$ .
Inner-bag score $\beta_k$ : Indicates inner-bag $X_{1,k}$ contribution to the positive outer-label; thresholding $\beta_k > \tau_2$ yields latent positive inner-bag assignment $\hat{y}^2_k$ .

This nested inference enables partial recovery of latent structure, as shown in medical whole-slide imaging (WSI) examples where attention highlights candidate lesions and regions.

5. Computational Workflow

The NMIA training/inference procedure is directly expressed in the following pseudocode:

Given nested dataset { X_i, y_i } for i = 1...N:
  Initialize θ
  repeat for epoch = 1...MaxEpochs:
    for each minibatch of outer-bags {X_i, y_i}:
      for each bag X_i:
        # Level 1: embed instances
        for k=1...K1, l=1...L_{1,k}:
          h_{1,k,l} ← f(x_{1,k,l}; θ_f)
        # Attention + aggregation at level 1
        for each inner-bag k:
          a_{1,k,l} ← exp(wᵀ h_{1,k,l} + b)
          α_{1,k,l} ← a_{1,k,l}/sum_l a_{1,k,l}
          m_{1,k} ← sum_l α_{1,k,l} h_{1,k,l}
        # Attention + aggregation at level 2
        for k=1...K1:
          b_k ← exp(vᵀ m_{1,k} + c)
        β_k ← b_k / sum_k b_k
        M ← sum_k β_k m_{1,k}
        ŷ_i ← Θ_c(M; θ_c)
      Compute loss L = sum_i [–y_i log ŷ_i – (1–y_i) log(1–ŷ_i)] + λ∥θ∥²
      Back-propagate ∇_θ L, update θ by SGD/Adam
    Validate on held-out bags; apply early stopping

At inference, the same forward pass provides $\hat{y}$ and attention maps $\{\alpha,\beta\}$ , supporting both outer prediction and interpretability for inner structure via attention thresholding.

6. Empirical Evaluation and Comparative Results

NMIA was evaluated on two-level (MNIST, PCAM) and three-level (MNIST "odd-only" rule) benchmarks, compared with alternative MIL architectures:

Dataset/Experiment	MI	MIA	NMI	NMIA
MNIST Exp1 (single-instance→bag)	0.929	0.957	0.923	0.959
MNIST Exp2 (≥2 positives in same inner-bag)	0.345	0.472	0.855	0.921
MNIST Exp3 (3-level "odd-only" rule)	N/A	N/A	0.556	0.836
PCAM Exp1 (standard MIL)	0.957	0.973	0.964	0.978
PCAM Exp2 (≥2 metastatic patches/region)	0.290	0.286	0.700	0.734

In easy tasks (Exp1), all models perform well, with NMIA slightly outperforming alternatives.
For rule-based tasks requiring the grouping of positives (Exp2), conventional MI/MIA architectures fail, while NMI and NMIA model the required relations, with NMIA achieving superior F1.
The three-level hierarchy (Exp3) demonstrates only the NMIA architecture's capacity to learn complex hierarchical rules (e.g., aggregating presence/absence across nested levels).
Qualitative attention visualizations confirm that $\alpha$ scores highlight salient instances (“9” digits, metastatic regions) and $\beta$ scores pinpoint relevant inner-bags.

A plausible implication is that NMIA enhances interpretability for nested weakly-supervised problems and is advantageous where ground-truth is available only at the highest level, but models or applications demand finer-grained insight into hierarchical structure.

NMIA generalizes attention-based MIL architectures via explicit hierarchical nesting, combining soft attention for instance selection with multi-level aggregation. This approach is especially pertinent for domains such as computational pathology, vision, and any application where entities are naturally grouped and only coarse labels are available. The nesting and attention extensibility allow NMIA to subsume previous MIL variants (mean aggregation, single-level attention) and outperform them in tasks necessitating hierarchical inference (Fuster et al., 2021).

The framework's broad applicability suggests future directions in further hierarchy modeling, explainable machine learning, and adaptation to domains with complex nested-label structures.

Markdown Upgrade to Chat

References (1)

Nested Multiple Instance Learning with Attention Mechanisms (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Nested Multiple Instance Learning with Attention (NMIA).