Coarse Guidance Network (CGN)
- Coarse Guidance Network (CGN) is a module that injects coarse spatial context into high-resolution patch features to enhance slide-level predictions in MIL frameworks.
- It remaps instance features to a coarse grid using field-of-view driven binning and processes them through a lightweight convolutional head to compute a guidance map.
- Empirical evaluations show that incorporating CGNs improves biomarker classification AUCs while maintaining low parameter and computational overhead.
A Coarse Guidance Network (CGN) is a module designed to learn and inject spatial contextual information at a coarser scale into high-magnification instance features within Multiple-Instance Learning (MIL) frameworks for whole-slide image (WSI) analysis. The CGN operates via grid-based remapping of instance features and a lightweight convolutional head to produce a coarse guidance map, which is then used to modulate the instance features before final attention-based aggregation. This approach enables progressive multi-scale context modeling in computational pathology tasks, offering a parameter-efficient mechanism for slide-level prediction enhancement while maintaining computational tractability (Wu et al., 2 Feb 2026).
1. Architectural Overview
The CGN processes high-magnification patch features and their normalized spatial coordinates . Its core workflow includes three sequential steps:
- Grid-based Remapping: High-magnification features are aggregated into a 3D coarse feature map based on spatial bin assignments determined by a selectable field-of-view (FOV) parameter.
- Convolutional Guidance Head: is processed by two 3×3 convolutions with ReLU activations and a 1×1 convolution with Sigmoid activation to yield the coarse guidance map .
- Patch-level Gating: is flattened and indexed to obtain , which gates each corresponding row in , resulting in modulated features .
The diagrammatic ASCII representation is:
5
2. Grid-based Remapping
Instance features and coordinates are mapped to a coarse grid via field-of-view driven binning. For each instance with normalized coordinates 0, the grid cell assignment 1 is determined as: 2 where 3, 4, with 5 the selected FOV.
The feature map 6 is computed by averaging all high-magnification vectors falling into each bin: 7 where 8 collects all instances assigned to grid cell 9. In vectorized notation,
0
followed by reshaping 1 to 2.
3. Convolutional Guidance Computation
After remapping, 3 is passed through three sequential convolutions: 4
5
6
Here, 7 is used as the hidden channel width for all CGN blocks. No self-attention or Transformer module is included; the head is purely convolutional.
8 is flattened to length 9, and each instance 0 gathers its coarse guidance value 1 according to its assigned index. The final gated features are 2.
4. Integration with Attention-based MIL
In standard attention-based MIL settings, instance embeddings 3 propagate through an attention aggregator 4 to yield slide-level predictions: 5 Installing a CGN at scale 6 updates 7 as: 8
Stacking multiple CGNs (for example, at FOVs 9) results in a progressive series of residual updates: 0 The final 1 is then input to attention modules such as ABMIL, DSMIL, CLAM-SB, or CLAM-MB, which conduct the slide-level aggregation.
5. Training Protocol and Hyperparameters
Training details for CGN-based models are as follows:
- Losses: Biomarker tasks (ER, PR, HER2 status) use cross-entropy loss. Prognosis tasks (CRC Surv) use a negative log-likelihood loss (NLLSurvLoss) that combines censored and uncensored terms:
2
3
4
- Optimizer: AdamW, learning rate 5, cosine-decay scheduler.
- Early stopping: patience = 10.
- Epochs: maximum 150.
- FOV choices: At 20×, e.g., 6 pixels (providing 3 CGNs).
- Hidden channels: 7 per CGN.
- Parameter and compute cost: Each CGN adds approximately 8M parameters per scale.
6. Empirical Performance and Ablation Results
Empirical studies isolating the CGN demonstrate a consistent benefit on multiple biomarker classification tasks. For instance, a single CGN (FOV=1536) added to ABMIL (using CONCH features) produces:
| System | ER AUC (%) | PR AUC (%) | HER2 AUC (%) | Params (M) | FLOPs (G) |
|---|---|---|---|---|---|
| ABMIL w/o CGN (single-scale 20×) | 87.22 | 84.14 | 80.06 | — | — |
| ABMIL + single CGN (FOV=1536) | 88.92 | 84.76 | 80.84 | ~1.51 | ~17.7 |
| ABMIL + three CGNs ([1536,2048,3072]) | 89.76 | 85.24 | 82.86 | ~2.18 | ~17.7 |
| ABMIL + five CGNs ([1024,1536,2048,2560,3072]) | 91.42 | 84.18 | 84.62 | — | — |
Adding at least one CGN leads to a clear increase in slide-level AUC—e.g., gains of +1.70pp (ER), +0.62pp (PR), and +0.78pp (HER2) for a single scale. Stacking multiple CGNs for progressive multi-scale guidance further improves performance (e.g., +4.20pp for ER, +4.56pp for HER2). CGNs achieve these gains at reduced parameter and compute cost relative to methods such as concatenation or cross-scale attention schemes (9M parameters/0G FLOPs for CGN vs. 1M/2G for cross-scale alternatives), while delivering larger accuracy improvements (+3.6pp ER, +4.05pp HER2).
7. Summary of Properties
A CGN remaps high-magnification features to a spatially coarse grid, applies a three-layer convolutional head to compute a coarse attention map, reprojects this map back to the patch level to gate the D-dimensional features, and is trained end-to-end via the same MIL objectives. Each CGN block is lightweight (requiring 3 hidden channels, 4M parameters per scale), incurs minimal additional computation, and has been shown to consistently improve slide-level prediction performance in clinical biomarker and prognosis tasks (Wu et al., 2 Feb 2026).