Shape-Adapting Gated Experts (SAGE)

Updated 30 November 2025

The paper introduces SAGE, an input-adaptive neural architecture that leverages a dual-path design and dynamic gating to address cellular heterogeneity in high-resolution cancer imaging.
It employs a hierarchical routing mechanism with shared-expert and semantic affinity gates to balance generalization and fine-grained specialization.
Empirical results show state-of-the-art Dice scores on multiple histopathology benchmarks, while also revealing trade-offs in computational overhead and hyperparameter sensitivity.

Shape-Adapting Gated Experts (SAGE) is an input-adaptive neural architecture designed to address the challenge of cellular heterogeneity in computer-aided cancer detection. By enabling dynamic routing among heterogeneous expert modules, SAGE transforms traditional static hybrid architectures—such as CNN-Transformer networks—into flexible, computation-efficient systems that adapt to variations in shape and scale present in gigapixel whole slide images (WSIs). The core innovation centers on a dual-path design featuring hierarchical gating and a harmonization hub, yielding state-of-the-art results for colonoscopic lesion segmentation across multiple medical image benchmarks (Thai et al., 23 Nov 2025).

1. Architecture and Dual-Path Design

SAGE is instantiated within a U-Net–style encoder (SAGE-UNet) that augments a ConvNeXt CNN + ViT backbone with a two-branch structure at each encoder layer $i$ :

Main Path $f_i(\cdot)$ : Processes input $z_{i-1}$ via the original backbone transformation, preserving pretrained representations.
Expert Path: Implements a gated Mixture-of-Experts (MoE), dynamically selecting and weighting outputs from a bank of expert modules. The output at layer $i$ is given by

$z_i = \alpha_i \, z_i^{(\mathrm{main})} + (1-\alpha_i)\, z_i^{(\mathrm{expert})}, \quad \alpha_i = \sigma(\theta_i)$

where $z_i^{(\mathrm{main})} = f_i(z_{i-1})$ and $z_i^{(\mathrm{expert})} = \sum_{k \in \mathcal{I}} w_k\, \hat z_i^{(k)}$ .

A central component—the Shape-Adapting Hub (SA-Hub)—ensures compatibility among convolutional and attention-based experts. Each expert’s input and output are transformed via learned adapters to match the semantic and structural domains across the CNN and Transformer modules.

2. Hierarchical Routing and Gating Mechanisms

SAGE’s routing operates in two hierarchical stages:

Shared-Expert Gate: A sigmoid-activated gate $g_s$ determines the preference for shared (general) versus fine-grained (specialized) experts, leveraging a global pooled summary of the current feature map:

$g_s = \sigma(\bar z_{i-1} W_{\mathrm{gate}}^{(i)} + b_{\mathrm{gate}}^{(i)}), \qquad \bar z_{i-1} = \mathrm{GlobalPool}(z_{i-1})$

Semantic Affinity Routing (SAR): Routing logits $\mathbf{L}_i \in \mathbb{R}^M$ for $M$ experts are computed via scaled dot-products, augmented with adaptive input-dependent noise:

$\mathbf{L}_i = \frac{(\bar z_{i-1} W_Q^{(i)})(K^{(i)})^\top}{\sqrt{d_k}} + \sigma_{\mathrm{noise}}^{(i)} \odot \boldsymbol{\epsilon}, \quad \sigma_{\mathrm{noise}}^{(i)} = \mathrm{softplus}(\bar z_{i-1} W_{\mathrm{noise}}^{(i)})$

These logits are modulated based on the shared-expert gate using a binary mask and log-priors.

Top-K Expert Selection: The system selects the indices of the $K$ top experts based on gated logits:

$w_j = \mathbb{I}_{\mathrm{top\text{-}K}(j)} \frac{\exp(L'_i)_j}{\sum_{k\in\mathcal{I}}\exp(L'_i)_k}, \quad \sum_j w_j = 1$

where $L'_i$ is the modulated routing logit.

Load-Balancing Loss: To avoid expert underutilization (router collapse), SAGE incorporates a balancing loss:

$\mathcal{L}_{\mathrm{balance}} = M \sum_{j=1}^M f_j P_j$

where $f_j$ is the fraction of tokens assigned to expert $j$ and $P_j$ is its mean routing probability.

3. Dynamic Routing Algorithm

The SAGE layer-wise forward pass consists of:

Initial feature extraction via the network stem.
For each encoder layer $i$ $i$ :
- Compute the main path output.
- Globally pool features for gating.
- Evaluate the shared-expert gate and SAR logits.
- Apply hierarchical modulation and select Top- $K$ experts.
- Use input/output adapters for selected experts.
- Fuse main and expert-path outputs via convex combination.
- Accumulate the balancing loss.
The final encoded representation is decoded, and supervised losses (cross-entropy, Dice) plus balancing loss are backpropagated.

This procedure enforces adaptive allocation of computation and expert specialization, biasing the model toward shared or specialized modules based on input statistics and learned gating.

4. Empirical Evaluation

Datasets

SAGE-UNet was evaluated on three prominent histopathology datasets:

Dataset	Images	Description
EBHI	795	Adenocarcinoma, H&E biopsies
DigestPath	660 (32,000 patches)	WSIs, challenge benchmark
GlaS	165	MICCAI gland segmentation

Metrics and Results

Performance was assessed using Dice Similarity (DSC), Intersection-over-Union (IoU), and pixel accuracy. Notable SOTA Dice scores with sigmoid gating:

Dataset	SAGE-UNet DSC (%)
EBHI	95.57
DigestPath	95.16
GlaS	94.17

These results exceed all static backbone and hybrid baselines, with up to +1.7% improvement over prior bests on GlaS (Thai et al., 23 Nov 2025).

Ablation Findings

Sigmoid gating consistently outperforms softmax (e.g., EBHI 95.57% vs. 95.05%).
Increasing Top- $K$ from 1 to 4 improves DSC by +5.4%.
Scaling the number of shared experts from 1 to 4 yields an additional +0.47% after $K=4$ .
Ablating SA-Hub incurs feature-mismatch errors and a 1–2% DSC drop.

5. Analysis: Strengths and Limitations

Advantages

Input-Adaptive Computation: Dynamically allocates FLOPs to regions/inputs of interest.
Shape and Scale Adaptation: Convolutional experts focus on local structures; Transformer experts address global context, enhancing robustness to heterogeneity.
Hierarchical Routing: The two-stage gating balances generalization (shared experts) and specialization, improving both domain adaptation and fine-grained accuracy.
Interpretability: Activation maps (e.g., via Grad-CAM) reveal spatially variable expert usage.

Limitations

Increased Overhead: Gating, adapters, and additional experts introduce parameter and latency costs.
Hyperparameter Sensitivity: Proper tuning of Top- $K$ , shared expert count, and balancing loss weight is required per application.
Potential Underutilization: Highly homogeneous inputs may not fully leverage expert diversity.
Router Collapse: Insufficient balancing weight ( $\lambda_{lb}$ ) can lead to mode collapse, wherein only a subset of experts remain active.

6. Prospects for Extension and Application

Outlined future directions include:

Expansion to three-dimensional and multi-modal imaging domains (e.g., simultaneous MRI + CT data).
Incorporation of dynamic routing within decoder stages for complex, multi-class semantic segmentation.
Meta-learning approaches to optimize gating and routing hyperparameters automatically.
Investigation of alternative gating/routing strategies optimized for sparsity or differentiability (including Gumbel-TopK and reinforcement learning approaches).
Development of resource-constrained variants with lightweight adapters for edge deployment.

A plausible implication is that such architecture generalizes beyond histopathology to any task characterized by marked input heterogeneity or multi-scale structure.

SAGE stands as a refinement of Mixture-of-Experts methods in visual recognition, extending static CNN-Transformer hybrids by equipping them with hierarchical, input-adaptive expert selection at fine spatial granularity. The architectural motif of dual-path feature fusion and structural/semantic harmonization via modular hubs reflects ongoing trends in biomedical image segmentation for performance and efficiency gains in high-resolution, variable-appearance domains (Thai et al., 23 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Shape-Adapting Gated Experts: Dynamic Expert Routing for Colonoscopic Lesion Segmentation (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Shape-Adapting Gated Experts (SAGE).