Continual Alignment for SAM (CA-SAM)

Updated 28 November 2025

The paper introduces CA-SAM, which utilizes a task-specific lightweight alignment layer and a VAE-based routing mechanism to adapt SAM for medical segmentation while preventing catastrophic forgetting.
CA-SAM achieves state-of-the-art segmentation performance with superior Avg-IoU and reduced GFLOPs compared to other continual learning methods.
Empirical evaluations on nine diverse medical imaging datasets demonstrate robust continual adaptation and nearly zero degradation on out-of-distribution tasks.

Continual Alignment for SAM (CA-SAM) is a continual learning paradigm designed to adapt the Segment Anything Model (SAM) to streaming medical image segmentation tasks while effectively mitigating catastrophic forgetting and maintaining state-of-the-art segmentation performance under strict parameter and computational constraints. CA-SAM is centered on the introduction of a lightweight, task-specific Alignment Layer and a VAE-based task routing mechanism, enabling robust continual adaptation across highly heterogeneous domains without replay or fine-tuning of the core SAM backbone (Wang et al., 21 Nov 2025).

1. Framework Architecture and Forward Pass

CA-SAM builds on a frozen SAM foundation, composed of a Vision Transformer (ViT)-based encoder $E(\cdot)$ and a mask decoder $D(\cdot)$ , neither of which are updated during continual learning. For each incoming segmentation task $t$ , a lightweight, trainable Alignment Layer $A_t(\cdot)$ is inserted between the encoder and decoder. Only $A_t$ is updated for task $t$ ; this decoupling preserves SAM’s strong zero-shot priors and computational efficiency. The overall forward pass for a given image $I$ on task $t$ involves: $Z = E(I),\qquad \tilde{Z} = A_t(Z),\qquad \hat{y} = D(\tilde{Z})$ Each task $t$ also maintains a dedicated Variational Autoencoder $V_t$ to explicitly model and score the distribution of encoder features, enabling accurate and automatic task routing at inference.

2. Mathematical Formulation of the Alignment Layer

The Alignment Layer $A_t$ is realized as a compact stack of Channel-Attention Residual Blocks (CAResBlock). For each spatial location $(h,w)$ and pixel index $b$ , the block operates on the encoder feature $Z_{b,:,h,w} \in \mathbb{R}^C$ : $f_{a} = W f_{e} + b$ where $f_e$ is the $C$ -dimensional encoder feature, $W\in\mathbb{R}^{C \times C}$ and $b\in\mathbb{R}^C$ are trainable parameters, and additional channel-attention and residual connections further refine $f_a$ . The transformed features $\tilde{Z}$ are computed as: $\tilde{Z}_{:,h,w} = A_t(Z_{:,h,w})$ This architectural minimalism yields an effective feature alignment while also drastically reducing the parameter and compute overhead normally associated with full SAM adaptation.

3. Continual Learning Protocol and Task Routing

CA-SAM operates in a pure continual learning setting: tasks $t=1,\ldots,N$ arrive as a sequence of datasets $\mathcal{D}_t^{tr},\mathcal{D}_t^{te}$ with no access to prior data (no replay). For each task:

Unique alignment layer $A_t$ and VAE $V_t$ are instantiated and trained (with $E, D$ frozen).
The VAE models the global, attention-weighted encoder feature $f$ via softmax attention pooling:

$\alpha_{h,w} = \mathrm{softmax}\left(\frac{\|Z_{:,h,w}\|_2}{C\cdot T}\right), \qquad f = \sum_{h,w} \alpha_{h,w} Z_{:,h,w}$

$V_t$ is trained to minimize the evidence lower bound (ELBO):

$\mathcal{L}_{VAE}(f) = \frac{1}{D}\|f - \hat{f}\|_2^2 + \frac{\beta}{2} \sum_{i=1}^{D} [\mu_i^2 + \sigma_i^2 - 1 - \log(\sigma_i^2)]$

At inference, for a test image $I$ :

Compute encoder features and their pooled global summary $f$ .
For each task $t$ , evaluate $V_t$ ’s ELBO score $s_t$ on $f$ .
Select the task $t^* = \arg\min_t s_t$ . If $s_{t^*} \leq \tau_{t^*}$ (threshold from training), use $A_{t^*}$ ; otherwise, apply the identity alignment $A_{id}(Z) = Z$ , reverting to pure zero-shot SAM.

4. Training, Inference, and Implementation Details

The CA-SAM training and inference workflow is summarized as follows:

Pseudocode Outline

For each task $t$ $t$ :
- Initialize $A_t$ (CAResBlock stack), $V_t$ (MLP encoder/decoder).
- Train $A_t$ by minimizing standard segmentation loss (pixel-wise cross-entropy or Dice) on $\mathcal{D}_t^{tr}$ , freezing $E$ and $D$ .
- Train $V_t$ using attention-pooled encoder outputs.
- Calibrate routing threshold $\tau_t$ as 97th-percentile ELBO from $K$ -fold cross-validation over $\mathcal{D}_t^{tr}$ .
At inference, compute $s_t$ for all $t$ and select the appropriate $A_{t^*}$ or $A_{id}$ as described.

Efficiency Benchmarks

Alignment Layer: 3.54 M trainable parameters (smaller than most adapters).
Training computational cost: 514 GFLOPs per $1\times1024\times1024$ image, representing a 25% reduction over other adapter schemes.
Plug-and-play: direct insertion of $A_t$ between encoder and decoder with no modification to SAM code.
Key hyperparameters: Adam optimizer, $1\text{e}{-4}$ learning rate (A), $5\text{e}{-4}$ (V), $\beta=16.5$ , batch size 6, 24 epochs for $A$ , 10 epochs for $V$ , attention pooling temperature $T=1$ .

5. Experimental Setup and Main Results

CA-SAM is evaluated on a nine-dataset medical continual segmentation benchmark, with tasks covering modalities such as MR, CT, histopathology, endoscopy, and dental X-rays.

Datasets (in task order):

ACDC, EBHI-SEG, 56Nx, DN, Polyp, MSD_Prostate, MSD_Spleen, Promise12, STS-2D

Metrics:

IoU, Boundary IoU (BIoU), Last-IoU (post-stream), Avg-IoU, FF-IoU (average forgetting), evaluated both per-task and across the full sequence; additional zero-shot metrics over five out-of-distribution datasets.

Main Results:

Single-dataset adaptation: CA-SAM achieves $80.15\%$ Avg-IoU and $66.52\%$ Avg-BIoU with the lowest parameter and FLOP overhead.
Continual learning (exemplar-free): $76.12\%$ Last-IoU, $76.90\%$ Avg-IoU, $1.43\%$ FF-IoU, outperforming all classical continual learning methods (e.g., LwF, EWC, ER, DER, L2P, MoDA) and rivaling upper-bound joint training.
Zero-shot: CA-SAM maintains $>99\%$ of original SAM’s IoU on unadapted domains, minimizing out-of-distribution (OOD) degradation.

Method	Params	GFLOPs	Avg-IoU	Avg-BIoU
SAM zero-shot	0 M	–	55.08%	37.67%
Decoder-tuning	4.06 M	669.8	70.40%	53.86%
HQ-SAM	5.14 M	678.9	72.91%	58.41%
SAMMed2D	13.31 M	728.2	75.17%	58.97%
CA-SAM	3.54 M	514.3	80.15%	66.52%

6. Ablation Studies and Analysis

Multiple ablation studies empirically isolate the core contributions of CA-SAM:

Feature alignment: Post- $A_t$ , total-variation and Jensen-Shannon divergence between feature distributions and full fine-tuning baselines decrease by $3$– $5\%$ .
Block depth: Increased number of CAResBlocks in $A_t$ improves Avg-IoU from $\sim72\%$ to $80\%$ .
Pooling mechanism: Parameter-free attention pooling achieves the highest task-wise IoU/BIoU compared to global average/mean or flattening/CLS-token methods.
VAE $\beta$ coefficient: Stable performance for $\beta \gtrsim 7$ , ensuring consistent segmentation and OOD accuracy.
Task order robustness: CA-SAM’s Last-IoU varies by less than $0.5\%$ across three random task orders, outperforming other continual learning baselines which fluctuate by over $10$ points.
Routing threshold: $\tau=p_{97}$ ELBO percentile is empirically optimal for balancing seen-task IoU ( $76.02\%$ ) and OOD accuracy ( $99.42\%$ ).
Visualization: t-SNE projections show $A_t$ produces well-clustered feature manifolds per dataset.

7. Relation to Other SAM Continual Adaptation Methods

CA-SAM’s design is distinguished by minimalistic but highly effective domain alignment and a robust, probabilistic routing mechanism:

Unlike RegCL (Shu et al., 16 Jul 2025), which merges LoRA adapter parameters to contain model size at a potential cost of “washed-out” domain representations, CA-SAM retains explicit per-task alignment layers, ensuring task-optimal adaptation without interference.
Compared to MoDA (Yang et al., 9 Dec 2024) and GFT/GAT-based selector pools, CA-SAM uses a VAE-based attention-pooled feature router, which obviates the need for auxiliary tokens and memory banks and reduces training complexity.
A plausible implication is that, while CA-SAM incurs parameter cost linear in the number of tasks (one $A_t$ per task), its FLOP and memory efficiency per task, and its ability to recover zero-shot behaviour when the distribution is not matched, offers a favorable trade-off for multi-institutional medical segmentation where data privacy precludes joint training.

Taken together, CA-SAM constitutes a distinct approach within the broader landscape of continual SAM adaptation: it achieves superior continual segmentation accuracy, nearly eliminates catastrophic forgetting, and maintains OOD generalization with extremely modest additional computational requirements (Wang et al., 21 Nov 2025).