Papers
Topics
Authors
Recent
Search
2000 character limit reached

Nested MIL with Attention

Updated 20 January 2026
  • The paper introduces a nested MIL framework that employs dedicated attention mechanisms at every hierarchy level to enhance prediction accuracy and interpretability.
  • NMIA is a hierarchical model that uses multi-level feature embedding and aggregation to capture complex dependencies in weakly supervised, bag-of-bags data.
  • Empirical evaluations demonstrate that NMIA outperforms traditional MIL approaches, particularly in tasks requiring structured latent label inference and rule-based aggregation.

Nested Multiple Instance Learning with Attention (NMIA) extends the canonical Multiple Instance Learning (MIL) paradigm to address weakly supervised problems with complex hierarchical structures, where only bag-of-bags labels are available and neither instance nor inner-bag labels are observed. NMIA introduces JJ levels of bag nesting and employs dedicated attention mechanisms at each level. This framework enables not only accurate prediction of the outermost bag labels but also interpretable soft predictions of latent labels at lower levels. The original model formulation and empirical analysis are detailed in "Nested Multiple Instance Learning with Attention Mechanisms" (Fuster et al., 2021).

1. Hierarchical Weak Supervision and Formal Setup

NMIA formalizes a setting where only the label y{0,1}y \in \{0,1\} of a single outermost bag XX is observed, but the data structure is intrinsically hierarchical:

  • Level 1 (Innermost): Instances x1,k,lRDx_{1,k,l} \in \mathbb{R}^D, grouped into inner-bags.
  • Levels 2...J–1: Each level jj comprises bags Xj,kX_{j,k} of elements from level j1j-1.
  • Level J (Outermost): The top-level bag XJ,1X_{J,1} contains inner-bags XJ1,kX_{J-1,k}, for k=1...KJ1k=1...K_{J-1}.

Notation:

  • xj,k,lx_{j,k,l}: ll-th element of kk-th bag at level jj (x1,k,lx_{1,k,l} for instance; xj,k,lx_{j,k,l} for embedding of sub-bag if j>1j>1).
  • Xj,k={xj,k,ll=1,...,Lj,k}X_{j,k} = \{x_{j,k,l} | l=1,...,L_{j,k}\}: kk-th bag at level jj, Lj,kL_{j,k} elements.
  • X=XJ,1={XJ1,kk=1...KJ1}X = X_{J,1} = \{X_{J-1,k} | k=1...K_{J-1}\}.
  • yk,lj{0,1}y^j_{k,l} \in \{0,1\}: latent label of xj,k,lx_{j,k,l} (not observed).
  • Under standard MIL (J=1J=1), y=maxly1,l1y = \max_l y^1_{1,l}.

This nested organization generalizes MIL such that models can capture complex dependencies, like grouping similar instances or enforcing relational bag rules.

2. Model Architecture and Attention Mechanisms

NMIA employs a multi-tiered process for representation and aggregation, parameterized as follows:

2.1 Instance-level Feature Embedding

Each raw instance x1,k,lx_{1,k,l} is embedded:

h1,k,l=f(x1,k,l;θf)h_{1,k,l} = f(x_{1,k,l}; \theta_f)

where ff is typically a CNN or MLP.

2.2 Attention from Instance to Inner-bag

Attention scores for each instance in its inner-bag:

a1,k,l=exp(wh1,k,l+b)a_{1,k,l} = \exp(w^\top h_{1,k,l} + b)

α1,k,l=a1,k,lm=1L1,ka1,k,m\alpha_{1,k,l} = \frac{a_{1,k,l}}{\sum_{m=1}^{L_{1,k}} a_{1,k,m}}

with wRMw \in \mathbb{R}^M, bRb \in \mathbb{R}.

A gated-attention variant is also considered:

a1,k,l=exp[w(tanh(Vh1,k,l)σ(Uh1,k,l))]a_{1,k,l} = \exp\left[w^\top \left(\tanh(Vh_{1,k,l}) \odot \sigma(Uh_{1,k,l})\right)\right]

with V,URL×MV,U \in \mathbb{R}^{L \times M}, \odot element-wise, σ\sigma sigmoid.

2.3 Inner-Bag Representation Aggregation

Weighted sum for each inner-bag:

m1,k=l=1L1,kα1,k,lh1,k,lm_{1,k} = \sum_{l=1}^{L_{1,k}} \alpha_{1,k,l} h_{1,k,l}

2.4 Attention from Inner-bag to Outer-bag

Aggregation to the outer-bag:

bk=exp(vm1,k+c)b_k = \exp(v^\top m_{1,k} + c)

βk=bkn=1K1bn\beta_k = \frac{b_k}{\sum_{n=1}^{K_1} b_n}

with vRMv \in \mathbb{R}^{M}, cRc \in \mathbb{R}.

Final bag-of-bags embedding:

M=k=1K1βkm1,kM = \sum_{k=1}^{K_1} \beta_k m_{1,k}

2.5 Classification Head

Prediction via:

y^=Θc(M;θc)\hat{y} = \Theta_c(M; \theta_c)

where y^[0,1]\hat{y} \in [0,1] is the predicted probability.

3. Training Objective and Optimization

The model is trained end-to-end with combined parameters θ={θf,w,b,v,c,θc,...}\theta = \{\theta_f, w, b, v, c, \theta_c, ...\}, minimizing binary cross-entropy on outer-bag labels and optional 2\ell_2 regularization:

L(θ)=[ylogy^+(1y)log(1y^)]+λθ22L(\theta) = -[y \log \hat{y} + (1-y) \log(1-\hat{y})] + \lambda \| \theta \|_2^2

Early stopping is typically applied using a held-out validation set.

The full process from instance embedding to nested attention aggregation is differentiable, amenable to optimization by SGD or Adam.

4. Latent Label Prediction via Hierarchical Attention

Although supervision is available only at the outer bag level, NMIA leverages nested attention for latent label inference:

  • Instance-level score α1,k,l\alpha_{1,k,l}: Measures likelihood that instance x1,k,lx_{1,k,l} is positive within its inner-bag; thresholding α1,k,l>τ1\alpha_{1,k,l} > \tau_1 enables latent positive assignment y^k,l1=1\hat{y}^1_{k,l} = 1.
  • Inner-bag score βk\beta_k: Indicates inner-bag X1,kX_{1,k} contribution to the positive outer-label; thresholding βk>τ2\beta_k > \tau_2 yields latent positive inner-bag assignment y^k2\hat{y}^2_k.

This nested inference enables partial recovery of latent structure, as shown in medical whole-slide imaging (WSI) examples where attention highlights candidate lesions and regions.

5. Computational Workflow

The NMIA training/inference procedure is directly expressed in the following pseudocode:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Given nested dataset { X_i, y_i } for i = 1...N:
  Initialize θ
  repeat for epoch = 1...MaxEpochs:
    for each minibatch of outer-bags {X_i, y_i}:
      for each bag X_i:
        # Level 1: embed instances
        for k=1...K1, l=1...L_{1,k}:
          h_{1,k,l}  f(x_{1,k,l}; θ_f)
        # Attention + aggregation at level 1
        for each inner-bag k:
          a_{1,k,l}  exp(wᵀ h_{1,k,l} + b)
          α_{1,k,l}  a_{1,k,l}/sum_l a_{1,k,l}
          m_{1,k}  sum_l α_{1,k,l} h_{1,k,l}
        # Attention + aggregation at level 2
        for k=1...K1:
          b_k  exp(vᵀ m_{1,k} + c)
        β_k  b_k / sum_k b_k
        M  sum_k β_k m_{1,k}
        ŷ_i  Θ_c(M; θ_c)
      Compute loss L = sum_i [y_i log ŷ_i  (1y_i) log(1ŷ_i)] + λθ²
      Back-propagate _θ L, update θ by SGD/Adam
    Validate on held-out bags; apply early stopping

At inference, the same forward pass provides y^\hat{y} and attention maps {α,β}\{\alpha,\beta\}, supporting both outer prediction and interpretability for inner structure via attention thresholding.

6. Empirical Evaluation and Comparative Results

NMIA was evaluated on two-level (MNIST, PCAM) and three-level (MNIST "odd-only" rule) benchmarks, compared with alternative MIL architectures:

Dataset/Experiment MI MIA NMI NMIA
MNIST Exp1 (single-instance→bag) 0.929 0.957 0.923 0.959
MNIST Exp2 (≥2 positives in same inner-bag) 0.345 0.472 0.855 0.921
MNIST Exp3 (3-level "odd-only" rule) N/A N/A 0.556 0.836
PCAM Exp1 (standard MIL) 0.957 0.973 0.964 0.978
PCAM Exp2 (≥2 metastatic patches/region) 0.290 0.286 0.700 0.734
  • In easy tasks (Exp1), all models perform well, with NMIA slightly outperforming alternatives.
  • For rule-based tasks requiring the grouping of positives (Exp2), conventional MI/MIA architectures fail, while NMI and NMIA model the required relations, with NMIA achieving superior F1.
  • The three-level hierarchy (Exp3) demonstrates only the NMIA architecture's capacity to learn complex hierarchical rules (e.g., aggregating presence/absence across nested levels).
  • Qualitative attention visualizations confirm that α\alpha scores highlight salient instances (“9” digits, metastatic regions) and β\beta scores pinpoint relevant inner-bags.

A plausible implication is that NMIA enhances interpretability for nested weakly-supervised problems and is advantageous where ground-truth is available only at the highest level, but models or applications demand finer-grained insight into hierarchical structure.

NMIA generalizes attention-based MIL architectures via explicit hierarchical nesting, combining soft attention for instance selection with multi-level aggregation. This approach is especially pertinent for domains such as computational pathology, vision, and any application where entities are naturally grouped and only coarse labels are available. The nesting and attention extensibility allow NMIA to subsume previous MIL variants (mean aggregation, single-level attention) and outperform them in tasks necessitating hierarchical inference (Fuster et al., 2021).

The framework's broad applicability suggests future directions in further hierarchy modeling, explainable machine learning, and adaptation to domains with complex nested-label structures.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Nested Multiple Instance Learning with Attention (NMIA).