Continual Alignment for SAM (CA-SAM)
- The paper introduces CA-SAM, which utilizes a task-specific lightweight alignment layer and a VAE-based routing mechanism to adapt SAM for medical segmentation while preventing catastrophic forgetting.
- CA-SAM achieves state-of-the-art segmentation performance with superior Avg-IoU and reduced GFLOPs compared to other continual learning methods.
- Empirical evaluations on nine diverse medical imaging datasets demonstrate robust continual adaptation and nearly zero degradation on out-of-distribution tasks.
Continual Alignment for SAM (CA-SAM) is a continual learning paradigm designed to adapt the Segment Anything Model (SAM) to streaming medical image segmentation tasks while effectively mitigating catastrophic forgetting and maintaining state-of-the-art segmentation performance under strict parameter and computational constraints. CA-SAM is centered on the introduction of a lightweight, task-specific Alignment Layer and a VAE-based task routing mechanism, enabling robust continual adaptation across highly heterogeneous domains without replay or fine-tuning of the core SAM backbone (Wang et al., 21 Nov 2025).
1. Framework Architecture and Forward Pass
CA-SAM builds on a frozen SAM foundation, composed of a Vision Transformer (ViT)-based encoder and a mask decoder , neither of which are updated during continual learning. For each incoming segmentation task , a lightweight, trainable Alignment Layer is inserted between the encoder and decoder. Only is updated for task ; this decoupling preserves SAM’s strong zero-shot priors and computational efficiency. The overall forward pass for a given image on task involves: Each task also maintains a dedicated Variational Autoencoder to explicitly model and score the distribution of encoder features, enabling accurate and automatic task routing at inference.
2. Mathematical Formulation of the Alignment Layer
The Alignment Layer is realized as a compact stack of Channel-Attention Residual Blocks (CAResBlock). For each spatial location and pixel index , the block operates on the encoder feature : where is the -dimensional encoder feature, and are trainable parameters, and additional channel-attention and residual connections further refine . The transformed features are computed as: This architectural minimalism yields an effective feature alignment while also drastically reducing the parameter and compute overhead normally associated with full SAM adaptation.
3. Continual Learning Protocol and Task Routing
CA-SAM operates in a pure continual learning setting: tasks arrive as a sequence of datasets with no access to prior data (no replay). For each task:
- Unique alignment layer and VAE are instantiated and trained (with frozen).
- The VAE models the global, attention-weighted encoder feature via softmax attention pooling:
- is trained to minimize the evidence lower bound (ELBO):
At inference, for a test image :
- Compute encoder features and their pooled global summary .
- For each task , evaluate ’s ELBO score on .
- Select the task . If (threshold from training), use ; otherwise, apply the identity alignment , reverting to pure zero-shot SAM.
4. Training, Inference, and Implementation Details
The CA-SAM training and inference workflow is summarized as follows:
Pseudocode Outline
- For each task :
- Initialize (CAResBlock stack), (MLP encoder/decoder).
- Train by minimizing standard segmentation loss (pixel-wise cross-entropy or Dice) on , freezing and .
- Train using attention-pooled encoder outputs.
- Calibrate routing threshold as 97th-percentile ELBO from -fold cross-validation over .
- At inference, compute for all and select the appropriate or as described.
Efficiency Benchmarks
- Alignment Layer: 3.54 M trainable parameters (smaller than most adapters).
- Training computational cost: 514 GFLOPs per image, representing a 25% reduction over other adapter schemes.
- Plug-and-play: direct insertion of between encoder and decoder with no modification to SAM code.
- Key hyperparameters: Adam optimizer, learning rate (A), (V), , batch size 6, 24 epochs for , 10 epochs for , attention pooling temperature .
5. Experimental Setup and Main Results
CA-SAM is evaluated on a nine-dataset medical continual segmentation benchmark, with tasks covering modalities such as MR, CT, histopathology, endoscopy, and dental X-rays.
Datasets (in task order):
ACDC, EBHI-SEG, 56Nx, DN, Polyp, MSD_Prostate, MSD_Spleen, Promise12, STS-2D
Metrics:
IoU, Boundary IoU (BIoU), Last-IoU (post-stream), Avg-IoU, FF-IoU (average forgetting), evaluated both per-task and across the full sequence; additional zero-shot metrics over five out-of-distribution datasets.
Main Results:
- Single-dataset adaptation: CA-SAM achieves Avg-IoU and Avg-BIoU with the lowest parameter and FLOP overhead.
- Continual learning (exemplar-free): Last-IoU, Avg-IoU, FF-IoU, outperforming all classical continual learning methods (e.g., LwF, EWC, ER, DER, L2P, MoDA) and rivaling upper-bound joint training.
- Zero-shot: CA-SAM maintains of original SAM’s IoU on unadapted domains, minimizing out-of-distribution (OOD) degradation.
| Method | Params | GFLOPs | Avg-IoU | Avg-BIoU |
|---|---|---|---|---|
| SAM zero-shot | 0 M | – | 55.08% | 37.67% |
| Decoder-tuning | 4.06 M | 669.8 | 70.40% | 53.86% |
| HQ-SAM | 5.14 M | 678.9 | 72.91% | 58.41% |
| SAMMed2D | 13.31 M | 728.2 | 75.17% | 58.97% |
| CA-SAM | 3.54 M | 514.3 | 80.15% | 66.52% |
6. Ablation Studies and Analysis
Multiple ablation studies empirically isolate the core contributions of CA-SAM:
- Feature alignment: Post-, total-variation and Jensen-Shannon divergence between feature distributions and full fine-tuning baselines decrease by $3$–.
- Block depth: Increased number of CAResBlocks in improves Avg-IoU from to .
- Pooling mechanism: Parameter-free attention pooling achieves the highest task-wise IoU/BIoU compared to global average/mean or flattening/CLS-token methods.
- VAE coefficient: Stable performance for , ensuring consistent segmentation and OOD accuracy.
- Task order robustness: CA-SAM’s Last-IoU varies by less than across three random task orders, outperforming other continual learning baselines which fluctuate by over $10$ points.
- Routing threshold: ELBO percentile is empirically optimal for balancing seen-task IoU () and OOD accuracy ().
- Visualization: t-SNE projections show produces well-clustered feature manifolds per dataset.
7. Relation to Other SAM Continual Adaptation Methods
CA-SAM’s design is distinguished by minimalistic but highly effective domain alignment and a robust, probabilistic routing mechanism:
- Unlike RegCL (Shu et al., 16 Jul 2025), which merges LoRA adapter parameters to contain model size at a potential cost of “washed-out” domain representations, CA-SAM retains explicit per-task alignment layers, ensuring task-optimal adaptation without interference.
- Compared to MoDA (Yang et al., 9 Dec 2024) and GFT/GAT-based selector pools, CA-SAM uses a VAE-based attention-pooled feature router, which obviates the need for auxiliary tokens and memory banks and reduces training complexity.
- A plausible implication is that, while CA-SAM incurs parameter cost linear in the number of tasks (one per task), its FLOP and memory efficiency per task, and its ability to recover zero-shot behaviour when the distribution is not matched, offers a favorable trade-off for multi-institutional medical segmentation where data privacy precludes joint training.
Taken together, CA-SAM constitutes a distinct approach within the broader landscape of continual SAM adaptation: it achieves superior continual segmentation accuracy, nearly eliminates catastrophic forgetting, and maintains OOD generalization with extremely modest additional computational requirements (Wang et al., 21 Nov 2025).