nnSAM: Plug-and-Play SAM Extension
- The paper advances nnSAM by integrating a frozen SAM encoder with nnUNet using level-set and curvature losses to enhance medical segmentation.
- It refines prompt optimization with an adversarial reinforcement learning agent that improves robustness against challenging prompt scenarios.
- The nnSAM frameworks enable plug-and-play adaptability and domain-specific regularization, significantly boosting performance in few-shot settings.
nnSAM (Plug-and-play Segment Anything Model) refers to a family of methods that extend the Segment Anything Model (SAM) with domain adaptation, robustness, or additional capabilities, typically in a modular, training-optional, or prompt-based fashion. Notably, the term "nnSAM" appears in distinct but related research lines: (1) a model for medical image segmentation combining SAM and nnUNet with level-set/curvature supervision (Li et al., 2023), and (2) a plug-in adversarial agent for prompt optimization to boost SAM’s robustness (Liu et al., 23 Sep 2025). This article surveys the core technical frameworks, mathematical foundations, training paradigms, and benchmarking of “nnSAM” in both senses.
1. Architectural Foundations
The two principal variants of nnSAM target distinct settings but share a focus on plug-and-play enhancement of SAM via feature-fusion, auxiliary losses, or external optimization agents.
nnSAM for Medical Segmentation (Li et al., 2023)
- Backbone integration: nnSAM attaches a frozen SAM encoder (pre-trained ViT on SA-1B) in parallel to an nnUNet encoder. SAM’s output embeddings (originally from a input) are bilinearly resized to match each nnUNet encoder level, then concatenated channel-wise.
- Fused decoder: The nnUNet decoder splits into (a) a segmentation head for per-pixel class probabilities , and (b) a regression head for predicting signed distance maps (level sets) , from which boundary curvature is derived.
- Training: Only nnUNet parameters (encoder–decoder, segmentation, regression heads) are updated. The SAM encoder’s weights remain frozen.
nnSAM as an Adversarial Prompt Optimizer (Liu et al., 23 Sep 2025)
- Agent wrapping: SAM is wrapped at inference time by a “defender” agent trained to refine point prompt sets; a paired “attacker” agent synthesizes worst-case prompts during training.
- Prompt environment: Each image is represented as a dual-space graph , with nodes as candidate prompts/patches. Node features combine DINOv2 semantic embeddings and 2D coordinates.
- Plug-in operation: The defender agent is the only run-time addition; SAM’s architecture and weights remain unchanged.
2. Mathematical Formulation and Optimization
Level-set and Curvature Supervision (Li et al., 2023)
To enforce anatomical priors, nnSAM augments the segmentation loss with boundary-shape losses using level sets and curvature:
- Signed distance function loss:
where is the Euclidean distance to the ground-truth boundary.
- Curvature loss: Level-set is sharpened by , and local curvature is
Curvature discrepancy is penalized as:
- Total loss:
with , , .
Adversarial DQN for Prompt Optimization (Liu et al., 23 Sep 2025)
Prompt refinement is framed as a two-player reinforcement learning game:
- State: , where encodes active prompts.
- Attacker action: Activates nodes (adds prompts), maximizing decrease in mask quality.
- Defender action: Deactivates nodes (removes prompts), maximizing recovery of mask quality.
- Reward:
- Attacker:
- Defender: , is IoU or Dice.
Both agents are Deep Q-Networks (GCN-based, two layers with width 128). -learning is conducted using experience replay and temporal difference loss: Only the defender agent is needed at inference.
3. Training Paradigms and Implementation
Data and Preprocessing (Li et al., 2023)
- Medical datasets: MR brain white-matter, CT heart, Chest X-ray lungs, CT liver.
- Few-shot regimes: Experiments performed with as few as 5–20 labeled images.
- nnUNet preprocessing: Intensity normalization, geometric augmentations.
- SAM preprocessing: Images resized to for SAM, then embeddings up/downsampled as needed.
nnSAM Adversarial Agent Training (Liu et al., 23 Sep 2025)
- Prompt graph construction: Grid or feature-matched candidate prompts with DINOv2 embeddings; adjacency encodes spatial + semantic proximity.
- Episodes: Each step alternates attacker/defender actions, updating the prompt set and querying SAM for mask quality.
- Optimization: Adam, learning rate , , -greedy exploration annealed.
- Stability: Gradient clipping and double-DQN to avoid overestimation.
Runtime Usage
- Medical nnSAM: Operates as a single end-to-end U-Net model with frozen SAM embedding; suitable for batch inference and training on small medical datasets.
- Prompt optimizer nnSAM: Pure PyTorch wrapper; performs 50 defender steps (0.1s overhead), calls SAM with refined prompt set; SAM weights never updated.
4. Comparative Quantitative Performance
Medical Segmentation Benchmarks (Li et al., 2023)
| Task | Method | Dice (%) | ASD (mm) |
|---|---|---|---|
| MR brain WM | nnSAM | 82.77 ± 10.12 | 1.14 ± 1.03 |
| nnUNet | 79.25 ± 17.24 | 1.36 ± 1.63 | |
| AutoSAM | 77.44 ± 14.69 | 1.69 ± 1.55 | |
| CT heart substructures | nnSAM | 94.19 ± 1.51 | 1.36 ± 0.42 |
| nnUNet | 93.76 ± 2.95 | 1.48 ± 0.65 | |
| CT liver | nnSAM | 85.24 ± 23.74 | 6.18 ± 16.02 |
| nnUNet | 83.69 ± 26.32 | 6.70 ± 15.66 | |
| Chest X-ray lungs | nnSAM | 93.63 ± 1.49 | 1.47 ± 0.42 |
| nnUNet | 93.01 ± 2.41 | 1.63 ± 0.57 |
- Few-shot: On brain WM with 5 samples, nnSAM yields Dice absolute gain over nnUNet; improvement persists with 10–20 samples.
- Interpretation: The largest benefit is observed under severe annotation scarcity, attributed to combined SAM features and level-set/curvature regularization.
Prompt Optimization Benchmarks (Liu et al., 23 Sep 2025)
| Dataset | mIoU Gain over Grid/Feature Prompts (%) |
|---|---|
| PASCAL VOC | +25.5 |
| ISIC | +9.2 |
| Kvasir | +23.4 |
- Ablation: Without attacker training, generalization degrades by 10% mIoU. Using dual-space graphs (vs. single-space) improves mIoU by 5% over either semantic-only or spatial-only graphs.
- Robustness: The defender agent tightens segmentation boundaries and prunes outliers, especially when facing noisy or adversarial prompt initializations.
5. Domain Adaptation, Scalability, and Extension
- Domain-agnostic design: The medical nnSAM exploits SAM’s domain-agnostic embeddings while learning medical priors via the nnUNet+curvature losses, enabling superior performance in limited-data domains.
- Plug-and-play inference: The prompt optimizer nnSAM acts as a drop-in front-end; no SAM re-training is required. Evaluation on natural, medical, and aerial imagery demonstrates generalization without retraining.
- Sample efficiency: Both lines of work highlight significant gains in limited annotation settings (e.g., 5–20 training samples in medical, 1-shot segmentation in prompt optimization).
- Computational profile: Overhead for the prompt optimizer (~0.1s/defender, most time in SAM call) is negligible for practical use; end-to-end nnSAM runs as a single neuroimaging model.
6. Limitations and Future Directions
- Level-set/curvature head: Assumes regular object shapes; may underperform for irregular structures such as tumors (Li et al., 2023), and currently limited to 2D.
- Prompt optimizer agent: Defender success depends on the quality of initial prompts; with semantically irrelevant or poorly distributed prompts, recovery is limited (Liu et al., 23 Sep 2025).
- 3D extension: Fusion of 3D SAM embeddings with 3D nnUNet architectures remains an unresolved technical challenge.
- Annotation minimization: While nnSAM methods work in few-shot, manual labels are still needed; one/zero-shot segmentation remains an important next step.
- Semantic generalization: Integration of prompt-free SAM finetuning or unsupervised shape priors, and further development of unsupervised or adaptive prompt generation methods, are posited as promising directions.
7. Significance and Broader Impact
The presented nnSAM frameworks represent modular, plug-and-play strategies to bridge foundational models (e.g., SAM) and high-performance domain adaptation (e.g., for medical imaging or robust prompt optimization). By combining frozen foundation model features, automated configuration, and domain-specific regularization, nnSAM consistently outperforms conventional architectures on small and heterogenous data with minimal human tuning. The adversarial prompt agent variant demonstrates that defense-for-attack RL paradigms can enable training-free, robust, and generalizable segmentation without modifying SAM’s backbone, facilitating practical deployment across domains.