PGCD: Prototype-Guided Calibration Distillation

Updated 13 November 2025

The paper introduces PGCD, a novel method that mitigates catastrophic forgetting by leveraging prototype-to-feature affinity in class-incremental tasks.
The methodology integrates spatial and class-aware guidance with dual-aligned prototype distillation to adaptively weight the distillation losses.
Empirical results on BTCV reveal improved Dice scores, demonstrating effective retention of old-class information alongside the integration of new anatomical structures.

Prototype-Guided Calibration Distillation (PGCD) is a technique developed to address catastrophic forgetting in class-incremental learning scenarios, particularly in medical image segmentation. Unlike conventional distillation approaches, which apply uniform constraints across the entire feature space, PGCD adaptively calibrates the distillation process by leveraging the affinity between learned prototypes and feature activations. This methodology integrates spatially and class-aware guidance into the knowledge transfer process, thereby preserving fine-grained knowledge of previously learned classes while mitigating the degradation of old information when new classes are introduced.

1. Motivation and Background

In class-incremental medical image segmentation (CIMIS), a segmentation model must learn new anatomical structures (classes) over time, adapting its knowledge without access to old-class labels. Standard knowledge distillation strategies—such as uniform pixelwise Kullback-Leibler (KL) divergence—fail to account for the spatial heterogeneity and class-specific relevance of features. These “one-size-fits-all” approaches often lead to suboptimal retention of old-class information, as equal penalization across all spatial locations may suppress distinctive cues crucial for accurate segmentation boundaries. Furthermore, purely prototype-based replay methods typically rely on aligning newly extracted (local) prototypes to previously aggregated (global) prototypes, neglecting the distributional and contextual drift that arises as the dataset evolves through incremental steps.

2. Key Principles and Mathematical Framework

PGCD operates by calibrating the strength of the distillation signal via prototype-to-feature similarity measures. For each spatial location in the feature map, PGCD estimates the affinity between the local feature vector and class prototypes—these prototypes (centroids) are maintained as moving averages of feature representations for each class across the training trajectory. The core components of the mathematical formulation are as follows:

Let $f^{(t-1)}_\theta$ and $f^{(t)}_\theta$ denote the frozen reference model (at the previous timestep) and the current model, respectively. The set of classes is partitioned as $C^{1:t-1}$ (old) and $C^t$ (new). Feature maps $F^\ast \in \mathbb{R}^{V \times K}$ are extracted, where $V$ is the number of spatial locations and $K$ is the channel dimension. For a given class $c$ and mini-batch, the local prototype is defined as: $\hat{p}_c = \frac{1}{|F_c|} \sum_{f_i \in F_c} f_i$ where $F_c = \{ f_i \mid Y(i) = c \}$ , and $Y(i)$ is the pseudo-label assignment. The global prototype $p_c$ is updated by a cumulative moving average: $p_c \leftarrow \frac{N_c^{pre} \cdot p_c^{pre} + |F_c| \cdot \hat{p}_c}{N_c^{pre} + |F_c|}$ where $N_c$ is the running sample count. The “background” prototype is recalculated at each step, treating label $0$ as an ordinary class in prototype extraction, to avoid the static-background artifact that arises from the evolving definitions of “background.”

For each voxel and class, PGCD computes the cosine similarity between the feature and the prototype, and uses this similarity as an attention weight to calibrate the region-wise distillation loss. Regions receiving higher affinity inherit stronger distillation constraints, focusing knowledge preservation on spatial locations that are more relevant to the old class.

3. Algorithmic Workflow

The training procedure at incremental step $t$ can be summarized as:

Freeze the previous model $f^{(t-1)}_\theta$ ; retrieve global prototypes $\{p_c\}$ for all old classes and background.
For each training batch: a. Extract feature maps $F^{t-1}$ from the frozen model; compute old local prototypes $\{\hat{p}_c^{t-1}\}$ . b. Extract feature maps $F^{t}$ from the current model; compute current local prototypes $\{\hat{p}_c^{t}\}$ . c. Update global prototypes via moving average. d. Generate pseudo-labels and spatial region masks distinguishing old from current class regions. e. Compute voxelwise affinities and regionwise calibrated distillation losses. f. Aggregate loss terms and backpropagate to update $\theta$ .

PGCD supports integration with Dual-Aligned Prototype Distillation (DAPD), which supplements pixelwise logit-level distillation by enforcing alignment between local prototypes (batch-specific) and both prior global prototypes (long-term semantic centers) and local prototypes from the reference model (batch-contextual knowledge).

4. Loss Function Integration

PGCD defines two regionwise distillation losses: the old-region calibrated distillation loss ( $L_\mathrm{orcd}$ ) and the current-region calibrated distillation loss ( $L_\mathrm{crcd}$ ). The total objective function at incremental step $t$ is: $L_\mathrm{total} = L_\mathrm{ce} + \lambda_\mathrm{orcd} \cdot L_\mathrm{orcd} + \lambda_\mathrm{crcd} \cdot L_\mathrm{crcd} + L_\mathrm{dapd}$ where:

$L_\mathrm{ce}$ is the unbiased cross-entropy loss.
$L_\mathrm{orcd}$ and $L_\mathrm{crcd}$ are the old- and current-region calibrated KL-distillation losses, weighted by $\lambda_\mathrm{orcd}$ and $\lambda_\mathrm{crcd}$ .
$L_\mathrm{dapd}$ is the DAPD loss, which regularizes prototype alignment. Key hyperparameters include regionwise loss weights ( $\lambda_\mathrm{orcd}$ , $\lambda_\mathrm{crcd}$ ), prototype alignment weights ( $\lambda_\mathrm{ll}$ , $\lambda_\mathrm{lg}$ for DAPD), the pseudo-label uncertainty threshold ( $\tau$ , e.g., 0.7), and the learning rate (0.01).

5. Implementation Details and Computational Considerations

PGCD and DAPD were implemented using a Swin UNETR backbone, with $K=96$ feature channels and $V=H \cdot W \cdot D$ voxels per slice or 3D patch. Prototype storage requires a single $K$ -dimensional vector per class (8–16 classes), yielding negligible memory (<1 MB). Cosine affinity computation for each voxel is $O(K)$ , and in-batch prototype updates are $O(K \cdot N_c)$ . Efficient implementation uses in-place prototype updates and matrix multiplication for affinity calculations; a batch size of 2 was adopted. Memory overhead is minimal, as only local prototypes per batch and the global prototypes are retained.

6. Empirical Results and Benchmarking

On BTCV (Bypass The Catastrophic hV), with a challenging 4-4 split across eight abdominal organ classes, the addition of PGCD and DAPD improved the old-class Dice from approximately 89.3% (SDR baseline) to 92.5%, new-class Dice to 88.7%, and overall Dice to 90.6%. Ablation studies showed that aligning only to global prototypes yielded a +0.96% DSC gain, aligning only to local prototypes yielded +0.64%, and the combined dual-alignment gave +1.48%. Across diverse incremental scenarios on BTCV and WORD, the combined PGCD + DAPD approach outperformed prior state-of-the-art methods (MiB, SDR, PLOP, CoNuSeg) by 1–2% absolute Dice coefficient on both old-class retention and aggregate segmentation performance. Qualitative analyses confirmed that organ boundaries (e.g., kidneys, spleen, liver) remain well-preserved after multiple incremental learning steps. This suggests that prototype-calibrated distillation robustly preserves detailed anatomical knowledge across incremental updates.

PGCD addresses fundamental limitations of uniform distillation and naïve prototype-replay in class-incremental segmentation, offering a spatially and semantically adaptive transfer mechanism. Its deployment in medical imaging tasks demonstrates resilience to label drift, background evolution, and knowledge degradation common in continual learning. This approach establishes a general template for prototype-guided distillation strategies in structured prediction tasks where regionwise and classwise granularity is critical. A plausible implication is that prototype-guided calibration, when combined with dual-aligned prototype distillation, can be extended to multi-modal and cross-domain continual learning frameworks, further enhancing robustness to distributional and semantic shift (Zhu et al., 11 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Class Incremental Medical Image Segmentation via Prototype-Guided Calibration and Dual-Aligned Distillation (2025)

Follow Topic

Get notified by email when new papers are published related to Prototype-Guided Calibration Distillation (PGCD).