Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 167 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 31 tok/s Pro
GPT-5 High 31 tok/s Pro
GPT-4o 106 tok/s Pro
Kimi K2 187 tok/s Pro
GPT OSS 120B 443 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

nnSAM: Plug-and-Play SAM Extension

Updated 12 November 2025
  • The paper advances nnSAM by integrating a frozen SAM encoder with nnUNet using level-set and curvature losses to enhance medical segmentation.
  • It refines prompt optimization with an adversarial reinforcement learning agent that improves robustness against challenging prompt scenarios.
  • The nnSAM frameworks enable plug-and-play adaptability and domain-specific regularization, significantly boosting performance in few-shot settings.

nnSAM (Plug-and-play Segment Anything Model) refers to a family of methods that extend the Segment Anything Model (SAM) with domain adaptation, robustness, or additional capabilities, typically in a modular, training-optional, or prompt-based fashion. Notably, the term "nnSAM" appears in distinct but related research lines: (1) a model for medical image segmentation combining SAM and nnUNet with level-set/curvature supervision (Li et al., 2023), and (2) a plug-in adversarial agent for prompt optimization to boost SAM’s robustness (Liu et al., 23 Sep 2025). This article surveys the core technical frameworks, mathematical foundations, training paradigms, and benchmarking of “nnSAM” in both senses.

1. Architectural Foundations

The two principal variants of nnSAM target distinct settings but share a focus on plug-and-play enhancement of SAM via feature-fusion, auxiliary losses, or external optimization agents.

  • Backbone integration: nnSAM attaches a frozen SAM encoder (pre-trained ViT on SA-1B) in parallel to an nnUNet encoder. SAM’s output embeddings (originally 64×6464\times64 from a 1024×10241024\times1024 input) are bilinearly resized to match each nnUNet encoder level, then concatenated channel-wise.
  • Fused decoder: The nnUNet decoder splits into (a) a segmentation head for per-pixel class probabilities {pj(a,b)}\{p_j(a,b)\}, and (b) a regression head for predicting signed distance maps (level sets) ϕ(a,b)\phi'(a,b), from which boundary curvature is derived.
  • Training: Only nnUNet parameters (encoder–decoder, segmentation, regression heads) are updated. The SAM encoder’s weights remain frozen.
  • Agent wrapping: SAM is wrapped at inference time by a “defender” agent trained to refine point prompt sets; a paired “attacker” agent synthesizes worst-case prompts during training.
  • Prompt environment: Each image is represented as a dual-space graph G=(V,E)G=(V,E), with nodes as candidate prompts/patches. Node features combine DINOv2 semantic embeddings and 2D coordinates.
  • Plug-in operation: The defender agent is the only run-time addition; SAM’s architecture and weights remain unchanged.

2. Mathematical Formulation and Optimization

To enforce anatomical priors, nnSAM augments the segmentation loss with boundary-shape losses using level sets and curvature:

  • Signed distance function loss:

ϕ(a,b)={d(a,b)inside 0boundary +d(a,b)outside\phi(a,b)= \begin{cases} -d(a,b) & \text{inside} \ 0 & \text{boundary} \ +d(a,b) & \text{outside} \end{cases}

where d(a,b)d(a,b) is the Euclidean distance to the ground-truth boundary.

Lossl=1HWCa,b,j(ϕj(a,b)ϕj(a,b))2\mathrm{Loss}_l = \frac{1}{HWC}\sum_{a,b,j}(\phi_j(a,b)-\phi'_j(a,b))^2

  • Curvature loss: Level-set ϕ\phi is sharpened by ϕ^=σ(1000ϕ)\hat\phi = \sigma(-1000\,\phi), and local curvature is

κϕ^=(1+ϕ^a2)ϕ^bb+(1+ϕ^b2)ϕ^aa2ϕ^aϕ^bϕ^ab2(1+ϕ^a2+ϕ^b2)3/2\kappa_{\hat\phi} = \frac{|(1+\hat\phi_a^2)\hat\phi_{bb} + (1+\hat\phi_b^2)\hat\phi_{aa} - 2\hat\phi_a\hat\phi_b\hat\phi_{ab}|}{2(1+\hat\phi_a^2+\hat\phi_b^2)^{3/2}}

Curvature discrepancy is penalized as:

Lossc=1HWCa,b,jκϕ^j(a,b)κϕ^j(a,b)\mathrm{Loss}_c = \frac{1}{HWC}\sum_{a,b,j}|\kappa_{\hat\phi_j}(a,b) - \kappa_{\hat\phi'_j}(a,b)|

  • Total loss:

Loss=λ1Losss+λ2Lossl+λ3Lossc\mathrm{Loss} = \lambda_1 \mathrm{Loss}_s + \lambda_2 \mathrm{Loss}_l + \lambda_3 \mathrm{Loss}_c

with λ1=1\lambda_1 = 1, λ2=0.1\lambda_2 = 0.1, λ3=104\lambda_3 = 10^{-4}.

Prompt refinement is framed as a two-player reinforcement learning game:

  • State: (G,σt)(G, \sigma_t), where σt{0,1}n\sigma_t\in\{0,1\}^n encodes active prompts.
  • Attacker action: Activates nodes (adds prompts), maximizing decrease in mask quality.
  • Defender action: Deactivates nodes (removes prompts), maximizing recovery of mask quality.
  • Reward:
    • Attacker: rtatk=S(Pt1)S(Pt)r_t^{\mathrm{atk}} = S(P_{t-1}) - S(P_t)
    • Defender: rtdef=S(Pt)S(Pt1)r_t^{\mathrm{def}} = S(P_t) - S(P_{t-1}), SS is IoU or Dice.

Both agents are Deep Q-Networks (GCN-based, two layers with width 128). QQ-learning is conducted using experience replay and temporal difference loss: L=(rt+γmaxaQθ(st+1,a)Qθ(st,at))2\mathcal L = \left( r_t + \gamma \max_{a'} Q_{\theta^-}(s_{t+1},a') - Q_\theta(s_t,a_t) \right)^2 Only the defender agent is needed at inference.

3. Training Paradigms and Implementation

  • Medical datasets: MR brain white-matter, CT heart, Chest X-ray lungs, CT liver.
  • Few-shot regimes: Experiments performed with as few as 5–20 labeled images.
  • nnUNet preprocessing: Intensity normalization, geometric augmentations.
  • SAM preprocessing: Images resized to 1024×10241024\times1024 for SAM, then embeddings up/downsampled as needed.
  • Prompt graph construction: Grid or feature-matched candidate prompts with DINOv2 embeddings; adjacency encodes spatial + semantic proximity.
  • Episodes: Each step alternates attacker/defender actions, updating the prompt set and querying SAM for mask quality.
  • Optimization: Adam, learning rate 1×1041\times10^{-4}, γ=0.99\gamma=0.99, ϵ\epsilon-greedy exploration annealed.
  • Stability: Gradient clipping and double-DQN to avoid overestimation.

Runtime Usage

  • Medical nnSAM: Operates as a single end-to-end U-Net model with frozen SAM embedding; suitable for batch inference and training on small medical datasets.
  • Prompt optimizer nnSAM: Pure PyTorch wrapper; performs \sim50 defender steps (\sim0.1s overhead), calls SAM with refined prompt set; SAM weights never updated.

4. Comparative Quantitative Performance

Task Method Dice (%) ASD (mm)
MR brain WM nnSAM 82.77 ± 10.12 1.14 ± 1.03
nnUNet 79.25 ± 17.24 1.36 ± 1.63
AutoSAM 77.44 ± 14.69 1.69 ± 1.55
CT heart substructures nnSAM 94.19 ± 1.51 1.36 ± 0.42
nnUNet 93.76 ± 2.95 1.48 ± 0.65
CT liver nnSAM 85.24 ± 23.74 6.18 ± 16.02
nnUNet 83.69 ± 26.32 6.70 ± 15.66
Chest X-ray lungs nnSAM 93.63 ± 1.49 1.47 ± 0.42
nnUNet 93.01 ± 2.41 1.63 ± 0.57
  • Few-shot: On brain WM with 5 samples, nnSAM yields +6.3%+6.3\% Dice absolute gain over nnUNet; improvement persists with 10–20 samples.
  • Interpretation: The largest benefit is observed under severe annotation scarcity, attributed to combined SAM features and level-set/curvature regularization.
Dataset mIoU Gain over Grid/Feature Prompts (%)
PASCAL VOC +25.5
ISIC +9.2
Kvasir +23.4
  • Ablation: Without attacker training, generalization degrades by \sim10% mIoU. Using dual-space graphs (vs. single-space) improves mIoU by \sim5% over either semantic-only or spatial-only graphs.
  • Robustness: The defender agent tightens segmentation boundaries and prunes outliers, especially when facing noisy or adversarial prompt initializations.

5. Domain Adaptation, Scalability, and Extension

  • Domain-agnostic design: The medical nnSAM exploits SAM’s domain-agnostic embeddings while learning medical priors via the nnUNet+curvature losses, enabling superior performance in limited-data domains.
  • Plug-and-play inference: The prompt optimizer nnSAM acts as a drop-in front-end; no SAM re-training is required. Evaluation on natural, medical, and aerial imagery demonstrates generalization without retraining.
  • Sample efficiency: Both lines of work highlight significant gains in limited annotation settings (e.g., 5–20 training samples in medical, 1-shot segmentation in prompt optimization).
  • Computational profile: Overhead for the prompt optimizer (~0.1s/defender, most time in SAM call) is negligible for practical use; end-to-end nnSAM runs as a single neuroimaging model.

6. Limitations and Future Directions

  • Level-set/curvature head: Assumes regular object shapes; may underperform for irregular structures such as tumors (Li et al., 2023), and currently limited to 2D.
  • Prompt optimizer agent: Defender success depends on the quality of initial prompts; with semantically irrelevant or poorly distributed prompts, recovery is limited (Liu et al., 23 Sep 2025).
  • 3D extension: Fusion of 3D SAM embeddings with 3D nnUNet architectures remains an unresolved technical challenge.
  • Annotation minimization: While nnSAM methods work in few-shot, manual labels are still needed; one/zero-shot segmentation remains an important next step.
  • Semantic generalization: Integration of prompt-free SAM finetuning or unsupervised shape priors, and further development of unsupervised or adaptive prompt generation methods, are posited as promising directions.

7. Significance and Broader Impact

The presented nnSAM frameworks represent modular, plug-and-play strategies to bridge foundational models (e.g., SAM) and high-performance domain adaptation (e.g., for medical imaging or robust prompt optimization). By combining frozen foundation model features, automated configuration, and domain-specific regularization, nnSAM consistently outperforms conventional architectures on small and heterogenous data with minimal human tuning. The adversarial prompt agent variant demonstrates that defense-for-attack RL paradigms can enable training-free, robust, and generalizable segmentation without modifying SAM’s backbone, facilitating practical deployment across domains.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to nnSAM: Plug-and-play Segment Anything Model.