Probabilistic Spatial Attention MIL

Updated 6 May 2026

The paper introduces a probabilistic spatial attention mechanism that embeds local context and quantifies uncertainty within the MIL framework.
It employs distance-aware priors and spatial pruning to boost computational efficiency and reliability in analyzing high-resolution images.
Empirical results on medical imaging datasets demonstrate state-of-the-art performance and improved interpretability using PSA-MIL.

Probabilistic Spatial Attention Multiple Instance Learning (PSA-MIL) is a class of deep learning frameworks developed to advance multiple instance learning (MIL) in spatially structured domains, particularly medical imaging. PSA-MIL models combine probabilistic formulations of attention mechanisms with spatial priors, enabling data-driven adaptation to local context, explicit quantification of uncertainty, and computational scalability for high-resolution data such as whole slide images (WSIs). These methods address the limitations of conventional attention-MIL techniques, which typically ignore spatial relationships or treat attention deterministically, by integrating distance-aware priors, spatial regularization, and variational inference to achieve state-of-the-art performance and improved interpretability (Schmidt et al., 2023, Peled et al., 20 Mar 2025, Castro-Macías et al., 20 Jul 2025).

1. Foundations and Motivation

The MIL paradigm represents each sample (“bag”) as a collection of unlabeled instances (e.g., tissue tiles in digital pathology or slices in CT scans), where only the aggregate label is known. Classic attention-MIL approaches aggregate instance-level features via attention weights but treat each instance independently, disregarding spatial arrangement. In high-dimensional spatial data—such as WSIs—neglecting local continuity leads to suboptimal detection of tissue structure and impairs model confidence estimation. PSA-MIL frameworks are developed to overcome these weaknesses by:

Embedding spatial relationships directly into the attention calculation, allowing attention to adapt to both the instance content and its spatial context.
Employing probabilistic mechanisms for attention, enabling instance-level uncertainty quantification and Bayesian regularization, which improves reliability when data is scarce or noisy (Schmidt et al., 2023, Castro-Macías et al., 20 Jul 2025).

2. Probabilistic Formulation of Attention

PSA-MIL architectures formally recast the canonical attention mechanism as probabilistic inference, where attention coefficients correspond to the posterior probabilities of selecting a key instance for a given query. For example, with query $q_i$ and keys $\{k_j\}$ , the generative process introduces a categorical latent variable $t$ (indicating key selection) with prior $p(t_j=1) = \pi_j$ and Gaussian likelihood $p(q_i|t_j=1) = \mathcal{N}(q_i|k_j,\sigma^2 I)$ (Peled et al., 20 Mar 2025). The posterior is:

$p(t_j=1|q_i) = \frac{\pi_j \, \mathcal{N}(q_i|k_j,\sigma^2 I)}{\sum_{j'} \pi_{j'} \mathcal{N}(q_i|k_{j'},\sigma^2 I)}$

This reduces to the classic softmax self-attention under uniform prior and standard scaling:

$p(t_j=1|q_i) = \frac{\exp(q_i^\top k_j/\sqrt{d_k})}{\sum_{j'} \exp(q_i^\top k_{j'}/\sqrt{d_k})}$

The innovation of PSA-MIL is in making the prior $\pi_j$ non-uniform and spatially informed.

3. Incorporating Spatial Priors

Spatial attention in PSA-MIL is realized by learning priors that decay with spatial distance between instances. The prior is parameterized with functions such as exponential, Gaussian, or Cauchy decays, $f(d_{ij}|\theta)$ , where $d_{ij}$ is the Euclidean distance between tiles $\{k_j\}$ 0 and $\{k_j\}$ 1, and $\{k_j\}$ 2 are learnable parameters. The spatially-aware attention is given by:

$\{k_j\}$ 3

This probabilistically integrates local spatial context at the attention layer, allowing the heads to specialize to different spatial scales (Peled et al., 20 Mar 2025).

An alternative approach places a smoothness prior directly on the (logit) attention vector $\{k_j\}$ 4, using the graph Laplacian $\{k_j\}$ 5 derived from adjacency matrix $\{k_j\}$ 6:

$\{k_j\}$ 7

This formulation encourages attention values to vary smoothly across adjacent spatial locations and is used within a variational inference framework (Castro-Macías et al., 20 Jul 2025).

4. Model Architecture and Training

PSA-MIL employs a modular architecture:

Instance Feature Extractor: A pre-trained CNN or transformer backbone encodes instances (e.g., ResNet50 for 512-dimensional tile embeddings) (Castro-Macías et al., 20 Jul 2025).
Spatial Attention Module: Implements spatially-modified attention via multi-head self-attention layers. Each head has its own learnable spatial decay parameter, and softmax normalization ensures output weights sum to one across all instances.
Aggregation and Prediction: Bag-level representations are constructed as weighted sums over instance features. These are passed to an MLP classifier for final label prediction (Peled et al., 20 Mar 2025, Castro-Macías et al., 20 Jul 2025).
Training Objective: The total objective combines supervised loss (e.g., cross-entropy or negative log-likelihood) with regularization penalties:
- KL divergence between variational and spatial priors for attention (if using variational distributions).
- A diversity loss to encourage multi-head attention diversity by penalizing low entropy in the kernel density of learned spatial decay parameters (Peled et al., 20 Mar 2025).

Training is end-to-end with Adam optimizer. In the variational formulation, stochastic attention samples are drawn with reparameterization for backpropagation.

5. Computational Scalability and Spatial Pruning

To mitigate the quadratic complexity ( $\{k_j\}$ 8) of full self-attention, PSA-MIL adopts a spatial pruning strategy. The decay function threshold $\{k_j\}$ 9 is used to limit the receptive field for each query, retaining only nearby keys:

$t$ 0

This restriction typically reduces cost to $t$ 1, where $t$ 2 is the number of neighbors within the threshold radius. This pruning is dynamic, as cutoff $t$ 3 adapts as spatial decay parameters are updated during training (Peled et al., 20 Mar 2025).

6. Uncertainty Quantification and Interpretability

The probabilistic nature of PSA-MIL enables robust quantification of predictive uncertainty. In formulations that define a posterior (Gaussian or variational) over attention scores, both mean and variance maps are available at the instance level. At prediction time, multiple samples of the attention vector $t$ 4 are drawn to compute bag-level class probabilities and uncertainty estimates:

$t$ 5

Mean attention highlights likely positive regions, while variance identifies ambiguous or unreliable predictions, enabling localization and quality control, particularly valuable in medical imaging tasks where interpretability and reliability are required (Schmidt et al., 2023, Castro-Macías et al., 20 Jul 2025).

7. Empirical Results and Limitations

PSA-MIL has been evaluated extensively on high-resolution WSI and CT datasets, including TCGA-CRC, TCGA-STAD, PANDA, RSNA, and CAMELYON16, achieving state-of-the-art AUCs and competitive accuracy/F1 across both contextual (TransMIL, GTP, SM-MIL, Bayes-MIL) and non-contextual (ABMIL, CLAM, DTFD-MIL, IBMIL) baselines. Spatially-aware variants (with Gaussian or exponential priors) consistently outperform standard attention-MIL (Peled et al., 20 Mar 2025, Castro-Macías et al., 20 Jul 2025). PSA-MIL models also demonstrate reduced FLOPs and parameter counts due to spatial pruning.

Empirical findings confirm that uncertainty estimates provided by PSA-MIL correlate with the risk of incorrect predictions. For example, misclassified bags show elevated prediction variance, and flagging high-uncertainty cases boosts effective performance metrics (Schmidt et al., 2023).

Current limitations include the use of a single-layer attention module and restriction to simple decay families for the spatial prior. Future directions include exploring deeper attention stacks, more expressive kernel families, and leveraging histology-driven anatomical priors for further refinement (Peled et al., 20 Mar 2025).

References:

"Probabilistic Attention based on Gaussian Processes for Deep Multiple Instance Learning" (Schmidt et al., 2023)
"PSA-MIL: A Probabilistic Spatial Attention-Based Multiple Instance Learning for Whole Slide Image Classification" (Peled et al., 20 Mar 2025)
"Probabilistic smooth attention for deep multiple instance learning in medical imaging" (Castro-Macías et al., 20 Jul 2025)