Papers
Topics
Authors
Recent
2000 character limit reached

Continual Alignment for SAM (CA-SAM)

Updated 28 November 2025
  • The paper introduces CA-SAM, which utilizes a task-specific lightweight alignment layer and a VAE-based routing mechanism to adapt SAM for medical segmentation while preventing catastrophic forgetting.
  • CA-SAM achieves state-of-the-art segmentation performance with superior Avg-IoU and reduced GFLOPs compared to other continual learning methods.
  • Empirical evaluations on nine diverse medical imaging datasets demonstrate robust continual adaptation and nearly zero degradation on out-of-distribution tasks.

Continual Alignment for SAM (CA-SAM) is a continual learning paradigm designed to adapt the Segment Anything Model (SAM) to streaming medical image segmentation tasks while effectively mitigating catastrophic forgetting and maintaining state-of-the-art segmentation performance under strict parameter and computational constraints. CA-SAM is centered on the introduction of a lightweight, task-specific Alignment Layer and a VAE-based task routing mechanism, enabling robust continual adaptation across highly heterogeneous domains without replay or fine-tuning of the core SAM backbone (Wang et al., 21 Nov 2025).

1. Framework Architecture and Forward Pass

CA-SAM builds on a frozen SAM foundation, composed of a Vision Transformer (ViT)-based encoder E()E(\cdot) and a mask decoder D()D(\cdot), neither of which are updated during continual learning. For each incoming segmentation task tt, a lightweight, trainable Alignment Layer At()A_t(\cdot) is inserted between the encoder and decoder. Only AtA_t is updated for task tt; this decoupling preserves SAM’s strong zero-shot priors and computational efficiency. The overall forward pass for a given image II on task tt involves: Z=E(I),Z~=At(Z),y^=D(Z~)Z = E(I),\qquad \tilde{Z} = A_t(Z),\qquad \hat{y} = D(\tilde{Z}) Each task tt also maintains a dedicated Variational Autoencoder VtV_t to explicitly model and score the distribution of encoder features, enabling accurate and automatic task routing at inference.

2. Mathematical Formulation of the Alignment Layer

The Alignment Layer AtA_t is realized as a compact stack of Channel-Attention Residual Blocks (CAResBlock). For each spatial location (h,w)(h,w) and pixel index bb, the block operates on the encoder feature Zb,:,h,wRCZ_{b,:,h,w} \in \mathbb{R}^C: fa=Wfe+bf_{a} = W f_{e} + b where fef_e is the CC-dimensional encoder feature, WRC×CW\in\mathbb{R}^{C \times C} and bRCb\in\mathbb{R}^C are trainable parameters, and additional channel-attention and residual connections further refine faf_a. The transformed features Z~\tilde{Z} are computed as: Z~:,h,w=At(Z:,h,w)\tilde{Z}_{:,h,w} = A_t(Z_{:,h,w}) This architectural minimalism yields an effective feature alignment while also drastically reducing the parameter and compute overhead normally associated with full SAM adaptation.

3. Continual Learning Protocol and Task Routing

CA-SAM operates in a pure continual learning setting: tasks t=1,,Nt=1,\ldots,N arrive as a sequence of datasets Dttr,Dtte\mathcal{D}_t^{tr},\mathcal{D}_t^{te} with no access to prior data (no replay). For each task:

  • Unique alignment layer AtA_t and VAE VtV_t are instantiated and trained (with E,DE, D frozen).
  • The VAE models the global, attention-weighted encoder feature ff via softmax attention pooling:

αh,w=softmax(Z:,h,w2CT),f=h,wαh,wZ:,h,w\alpha_{h,w} = \mathrm{softmax}\left(\frac{\|Z_{:,h,w}\|_2}{C\cdot T}\right), \qquad f = \sum_{h,w} \alpha_{h,w} Z_{:,h,w}

LVAE(f)=1Dff^22+β2i=1D[μi2+σi21log(σi2)]\mathcal{L}_{VAE}(f) = \frac{1}{D}\|f - \hat{f}\|_2^2 + \frac{\beta}{2} \sum_{i=1}^{D} [\mu_i^2 + \sigma_i^2 - 1 - \log(\sigma_i^2)]

At inference, for a test image II:

  1. Compute encoder features and their pooled global summary ff.
  2. For each task tt, evaluate VtV_t’s ELBO score sts_t on ff.
  3. Select the task t=argmintstt^* = \arg\min_t s_t. If stτts_{t^*} \leq \tau_{t^*} (threshold from training), use AtA_{t^*}; otherwise, apply the identity alignment Aid(Z)=ZA_{id}(Z) = Z, reverting to pure zero-shot SAM.

4. Training, Inference, and Implementation Details

The CA-SAM training and inference workflow is summarized as follows:

Pseudocode Outline

  • For each task tt:
    • Initialize AtA_t (CAResBlock stack), VtV_t (MLP encoder/decoder).
    • Train AtA_t by minimizing standard segmentation loss (pixel-wise cross-entropy or Dice) on Dttr\mathcal{D}_t^{tr}, freezing EE and DD.
    • Train VtV_t using attention-pooled encoder outputs.
    • Calibrate routing threshold τt\tau_t as 97th-percentile ELBO from KK-fold cross-validation over Dttr\mathcal{D}_t^{tr}.
  • At inference, compute sts_t for all tt and select the appropriate AtA_{t^*} or AidA_{id} as described.

Efficiency Benchmarks

  • Alignment Layer: 3.54 M trainable parameters (smaller than most adapters).
  • Training computational cost: 514 GFLOPs per 1×1024×10241\times1024\times1024 image, representing a 25% reduction over other adapter schemes.
  • Plug-and-play: direct insertion of AtA_t between encoder and decoder with no modification to SAM code.
  • Key hyperparameters: Adam optimizer, 1e41\text{e}{-4} learning rate (A), 5e45\text{e}{-4} (V), β=16.5\beta=16.5, batch size 6, 24 epochs for AA, 10 epochs for VV, attention pooling temperature T=1T=1.

5. Experimental Setup and Main Results

CA-SAM is evaluated on a nine-dataset medical continual segmentation benchmark, with tasks covering modalities such as MR, CT, histopathology, endoscopy, and dental X-rays.

Datasets (in task order):

ACDC, EBHI-SEG, 56Nx, DN, Polyp, MSD_Prostate, MSD_Spleen, Promise12, STS-2D

Metrics:

IoU, Boundary IoU (BIoU), Last-IoU (post-stream), Avg-IoU, FF-IoU (average forgetting), evaluated both per-task and across the full sequence; additional zero-shot metrics over five out-of-distribution datasets.

Main Results:

  • Single-dataset adaptation: CA-SAM achieves 80.15%80.15\% Avg-IoU and 66.52%66.52\% Avg-BIoU with the lowest parameter and FLOP overhead.
  • Continual learning (exemplar-free): 76.12%76.12\% Last-IoU, 76.90%76.90\% Avg-IoU, 1.43%1.43\% FF-IoU, outperforming all classical continual learning methods (e.g., LwF, EWC, ER, DER, L2P, MoDA) and rivaling upper-bound joint training.
  • Zero-shot: CA-SAM maintains >99%>99\% of original SAM’s IoU on unadapted domains, minimizing out-of-distribution (OOD) degradation.
Method Params GFLOPs Avg-IoU Avg-BIoU
SAM zero-shot 0 M 55.08% 37.67%
Decoder-tuning 4.06 M 669.8 70.40% 53.86%
HQ-SAM 5.14 M 678.9 72.91% 58.41%
SAMMed2D 13.31 M 728.2 75.17% 58.97%
CA-SAM 3.54 M 514.3 80.15% 66.52%

6. Ablation Studies and Analysis

Multiple ablation studies empirically isolate the core contributions of CA-SAM:

  • Feature alignment: Post-AtA_t, total-variation and Jensen-Shannon divergence between feature distributions and full fine-tuning baselines decrease by $3$–5%5\%.
  • Block depth: Increased number of CAResBlocks in AtA_t improves Avg-IoU from 72%\sim72\% to 80%80\%.
  • Pooling mechanism: Parameter-free attention pooling achieves the highest task-wise IoU/BIoU compared to global average/mean or flattening/CLS-token methods.
  • VAE β\beta coefficient: Stable performance for β7\beta \gtrsim 7, ensuring consistent segmentation and OOD accuracy.
  • Task order robustness: CA-SAM’s Last-IoU varies by less than 0.5%0.5\% across three random task orders, outperforming other continual learning baselines which fluctuate by over $10$ points.
  • Routing threshold: τ=p97\tau=p_{97} ELBO percentile is empirically optimal for balancing seen-task IoU (76.02%76.02\%) and OOD accuracy (99.42%99.42\%).
  • Visualization: t-SNE projections show AtA_t produces well-clustered feature manifolds per dataset.

7. Relation to Other SAM Continual Adaptation Methods

CA-SAM’s design is distinguished by minimalistic but highly effective domain alignment and a robust, probabilistic routing mechanism:

  • Unlike RegCL (Shu et al., 16 Jul 2025), which merges LoRA adapter parameters to contain model size at a potential cost of “washed-out” domain representations, CA-SAM retains explicit per-task alignment layers, ensuring task-optimal adaptation without interference.
  • Compared to MoDA (Yang et al., 9 Dec 2024) and GFT/GAT-based selector pools, CA-SAM uses a VAE-based attention-pooled feature router, which obviates the need for auxiliary tokens and memory banks and reduces training complexity.
  • A plausible implication is that, while CA-SAM incurs parameter cost linear in the number of tasks (one AtA_t per task), its FLOP and memory efficiency per task, and its ability to recover zero-shot behaviour when the distribution is not matched, offers a favorable trade-off for multi-institutional medical segmentation where data privacy precludes joint training.

Taken together, CA-SAM constitutes a distinct approach within the broader landscape of continual SAM adaptation: it achieves superior continual segmentation accuracy, nearly eliminates catastrophic forgetting, and maintains OOD generalization with extremely modest additional computational requirements (Wang et al., 21 Nov 2025).

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Continual Alignment for SAM (CA-SAM).