Segment-It-All Model (SIAM)

Updated 4 July 2026

SIAM is a unified segmentation paradigm that replaces specialized pipelines with models using shared representations and decoders across multiple segmentation tasks.
It employs synthetic training with domain randomization and a handful of high-quality templates to achieve robust 3D whole-head segmentation with high Dice scores.
Architecturally, SIAM integrates dynamic conditioning and efficient multi-scale processing, enabling real-time, contrast-agnostic segmentation across various modalities.

Searching arXiv for the cited SIAM-related papers and closely related records. “Segment It All Model” (SIAM) denotes a unifying segmentation paradigm in which a single model is designed to cover a broad segmentation scope rather than a narrowly delimited task. In the cited literature, the term appears in two closely related senses. First, it is the explicit name of a 3D whole-head segmentation framework for head and brain MRI segmentation from a small set of manually annotated templates (Valabregue et al., 4 May 2026). Second, it functions as a broader conceptual label for systems that unify multiple segmentation modalities under a shared architecture, as in RAP-SAM for real-time interactive segmentation, image panoptic segmentation, and video instance segmentation (Xu et al., 2024), and the One-Prompt Model for universal medical image segmentation conditioned by a single prompted exemplar (Wu et al., 2023). Across these works, the common objective is to replace task-specific pipelines with shared representations, shared decoders or conditioning paths, and training schemes that preserve generalization while reducing either latency, annotation cost, or template dependence.

1. Conceptual scope and problem formulation

The SIAM idea emerges from a common dissatisfaction with task-fragmented segmentation systems. In generic vision, the motivating problem is that most foundation-style segmentation methods rely on heavy encoder and decoder frameworks, which hinders their performance in real-time scenarios; real-time work, in turn, has often concentrated on semantic segmentation in specific environments and overlooked generalization across diverse scenarios (Xu et al., 2024). RAP-SAM addresses this by defining a real-time multi-purpose segmentation setting with three sub-tasks—interactive segmentation, panoptic segmentation, and video instance segmentation—handled by one end-to-end model with the same shared parameters and decoder.

In medical imaging, the problem is different but structurally analogous. One-Prompt Segmentation is introduced as a new conditioning paradigm for universal medical image segmentation. Instead of learning a task-specific function $y = f^d_\theta(x^d)$ , or a meta-function requiring a labeled support set $S^d = \{(x^d_j, y^d_j)\}$ , it learns a universal function $y = f_\theta(x^d_q, k^d)$ , where $k^d = \{x^d_c, p^d_c\}$ is a single prompted template per task (Wu et al., 2023). This changes the unit of conditioning from “one prompt per image” to “one prompted sample per task.”

In head and brain MRI, SIAM is the explicit model name for a synthetic-training framework that departs from both preprocessing-heavy classical neuroimaging pipelines and synthetic models trained on large banks of automatically labeled templates (Valabregue et al., 4 May 2026). Its central claim is that six high-quality, manually curated, high-resolution whole-head templates can suffice when combined with domain randomization in both intensity and shape domains.

A concise comparison clarifies the three usages.

Work	Core mechanism	Segmentation scope
RAP-SAM (Xu et al., 2024)	Shared prompt-driven decoder with dynamic convolution and light adapters	Interactive, image panoptic, video instance
One-Prompt Model (Wu et al., 2023)	Single prompted template with cross-attentional transfer	Universal medical image segmentation across unseen tasks
SIAM (Valabregue et al., 4 May 2026)	Few-template synthetic training with intensity and shape domain randomization	16-class 3D whole-head segmentation

This suggests that SIAM is best understood as a family of “segment-it-all” strategies rather than a single canonical architecture. The family resemblance lies in the attempt to unify segmentation targets, supervision protocols, and inference interfaces under one model.

2. Architectural organization

RAP-SAM is built around an efficient encoder, a lightweight neck, and a unified prompt-driven decoder (Xu et al., 2024). It uses lightweight CNN or mobile-friendly backbones such as ResNet-18/50, STDC-v1, SeaFormer, EdgeNeXt, and TopFormer, coupled with a feature pyramid neck enhanced by deformable convolutions to fuse multi-scale features into a single aligned map. For panoptic or interactive inputs, the encoder produces $F_{\text{img}} \in \mathbb{R}^{H/4 \times W/4 \times d}$ ; for video, the same backbone is applied per frame, yielding $F_{\text{vid}} \in \mathbb{R}^{T \times H/4 \times W/4 \times d}$ . Visual prompts are converted by SAM’s prompt encoder into prompt queries $P_i \in \mathbb{R}^{K \times d}$ , while learned object queries $Q_i \in \mathbb{R}^{N \times d}$ serve panoptic and video tasks. The decoder refines all queries together and then branches through two light adapters: an object adapter based on dynamic convolution and mask pooling, and a prompt adapter based on pixel-wise cross-attention.

One-Prompt uses an encoder-decoder with multi-scale skip connections in which both the query image and the template image pass through the same encoder (Wu et al., 2023). The decoder is a stack of One-Prompt Former blocks with two parallel branches, one for query features and one for template features, linked by extensive cross-attention. The conditioning core is the Prompt-Parser, which mixes prompt embeddings with template and query embeddings to produce an adaptive attentive mask. Prompt information is represented as two embeddings, $p^1$ and $p^2$ , and supports Click, Bounding box, Doodle, and SegLab prompt types.

The 2026 SIAM framework adopts a different architectural regime because its primary challenge is volumetric robustness rather than promptability (Valabregue et al., 4 May 2026). Its segmentation engine is nnU-Net in the 3D full-resolution variant with a residual encoder of seven blocks and channels $S^d = \{(x^d_j, y^d_j)\}$ 0. Inputs are resampled to $S^d = \{(x^d_j, y^d_j)\}$ 1 mm isotropic spacing; the input patch size is $S^d = \{(x^d_j, y^d_j)\}$ 2 voxels; and the output is a 16-class dense segmentation with no post-processing. Inference uses standard nnU-Net sliding-window inference with 5-fold ensemble averaging.

Although these systems differ sharply in modality and interface, their architectural logic is parallel. RAP-SAM unifies tasks through shared queries and a shared decoder; One-Prompt unifies tasks through a prompted template and cross-attentional transfer; SIAM unifies targets through a single whole-head label space and a single contrast-agnostic 3D segmentation network.

3. Conditioning mechanisms and mathematical structure

RAP-SAM’s decoder replaces heavy per-pixel cross-attention with pooling-based dynamic convolution (Xu et al., 2024). For image panoptic segmentation, mask pooling is written as

$S^d = \{(x^d_j, y^d_j)\}$ 3

and for video tubes as

$S^d = \{(x^d_j, y^d_j)\}$ 4

Query refinement uses gated dynamic convolution,

$S^d = \{(x^d_j, y^d_j)\}$ 5

followed by

$S^d = \{(x^d_j, y^d_j)\}$ 6

The unified training objective is

$S^d = \{(x^d_j, y^d_j)\}$ 7

with Hungarian matching assigning predicted queries to image instances, stuff regions, video tubes, or prompt-conditioned masks.

One-Prompt formalizes universal medical segmentation as

$S^d = \{(x^d_j, y^d_j)\}$ 8

where $S^d = \{(x^d_j, y^d_j)\}$ 9 is a single prompted template for task $y = f_\theta(x^d_q, k^d)$ 0 (Wu et al., 2023). Its Prompt-Parser defines

$y = f_\theta(x^d_q, k^d)$ 1

and

$y = f_\theta(x^d_q, k^d)$ 2

with Gaussian Masking

$y = f_\theta(x^d_q, k^d)$ 3

This creates a prompt-conditioned template embedding that is then transferred to the query path by cross-attention. Training uses a simple sum of Dice loss and binary cross-entropy,

$y = f_\theta(x^d_q, k^d)$ 4

with $y = f_\theta(x^d_q, k^d)$ 5.

The 2026 SIAM model is not prompt-conditioned; its unification mechanism is synthetic training with explicit domain randomization (Valabregue et al., 4 May 2026). The image synthesis step uses tissue-specific Gaussian distributions with means $y = f_\theta(x^d_q, k^d)$ 6 and variances $y = f_\theta(x^d_q, k^d)$ 7, and renders intensities by partial-volume mixing: $y = f_\theta(x^d_q, k^d)$ 8 Shape randomization is applied at $y = f_\theta(x^d_q, k^d)$ 9 mm resolution through morphological operations and random affine plus elastic deformations. The segmentation network is optimized with combined soft Dice and cross-entropy, as in nnU-Net.

These formulations illustrate three different answers to the same structural question: how to condition a single segmentation model on heterogeneous tasks or domains. RAP-SAM uses learned queries and prompt embeddings, One-Prompt uses a prompted template exemplar, and SIAM uses synthetic domain expansion in shape and intensity space.

4. Training data regimes and supervision strategies

RAP-SAM is co-trained across COCO panoptic and YouTube-VIS 2019 with identical hyperparameters, while interactive segmentation is trained on COCO-derived SAM-like prompts generated from ground-truth masks (Xu et al., 2024). Pseudo-video training on COCO is added by spatially shifting masks so that the decoder encounters motion-like patterns. Training is implemented in PyTorch with MMDetection, uses distributed training with $k^d = \{x^d_c, p^d_c\}$ 0 A100, 2 images per GPU, 24 epochs, AdamW with learning rate $k^d = \{x^d_c, p^d_c\}$ 1 and weight decay $k^d = \{x^d_c, p^d_c\}$ 2, warmup for 500 iterations, step decay at epochs 8 and 11, and Large-Scale Jitter augmentation with scale $k^d = \{x^d_c, p^d_c\}$ 3– $k^d = \{x^d_c, p^d_c\}$ 4 and fixed crop $k^d = \{x^d_c, p^d_c\}$ 5.

One-Prompt is trained on 64 open-source medical datasets and evaluated on 14 held-out datasets (Wu et al., 2023). The corpus spans fundus, ultrasound, histology, CT/MRI slices, X-ray panoramics, angiography video frames, dental CBCT, fetoscopy vessels, WBC microscopy, and gastrointestinal endoscopy. Each training dataset is divided into template, training, and validation splits, and in every training iteration a prompted template from the same dataset as the query image is randomly selected. The label semantics are normalized to the prompt-conditioned foreground/background notion rather than fixed class IDs.

The 2026 SIAM model is trained from only six manually curated high-resolution whole-head templates, drawn from one MIDA template and five additional multi-contrast subjects, including “skull” and “vasculature” cases (Valabregue et al., 4 May 2026). Each of three labeled template groups is used to precompute 1,000 synthetic image-label pairs offline, for a total of approximately 3,000 synthetic training examples. nnU-Net internal data augmentation is disabled because augmentation is performed entirely at the label-to-image generative stage. Training runs for 1,000 epochs with 5-fold cross-validation on NVIDIA A100-80GB hardware, requiring approximately two days per fold.

These supervision regimes differ in annotation economics. RAP-SAM relies on established vision datasets plus prompt simulation. One-Prompt relies on a large heterogeneous training corpus and over 3,000 clinician-labeled prompts. SIAM minimizes the number of templates but maximizes annotation quality and synthetic variability. A plausible implication is that the SIAM label can encompass both data-rich universal training and data-efficient synthetic generalization, provided the model covers a broad segmentation scope.

5. Empirical performance and benchmarking

RAP-SAM’s reported emphasis is the accuracy-speed trade-off in a shared multi-task setting (Xu et al., 2024). On an A100 GPU, RAP-SAM R18 achieves COCO-PQ 39.9, SQ 78.6, PQ_th 43.3, PQ_st 34.8, mIoU 52.7, COCO-SAM mIoU 38.7, with 60.5G FLOPs, 22.8M parameters, and 40.3 FPS. RAP-SAM R50 reaches COCO-PQ 46.9, SQ 80.8, PQ_th 51.6, PQ_st 39.8, mIoU 57.9, COCO-SAM mIoU 46.2, with 123.0G FLOPs, 47.2M parameters, and 35.1 FPS. Against Mask2Former R50, RAP-SAM R50 is reported as +4.0 PQ and +4.1 mIoU on COCO-SAM with +8.5 FPS advantage and lower FLOPs. On VIP-Seg validation, RAP-SAM R18 attains VPQ 32.5, STQ 33.7, FPS 30; on ADE20K panoptic, RAP-SAM R50 reaches PQ 38.3.

One-Prompt reports zero-shot transfer across 14 held-out medical tasks and interactive performance across 7 held-out datasets (Wu et al., 2023). In the efficiency comparison across 14 tasks, average Dice is 73.98 for One-Prompt, compared with 52.96 for ALPNet, 50.11 for PANet, 63.86 for HyperSegNas, and 64.66 for UniverSeg, while a task-specific TransUNet upper bound is 77.21. In zero-shot “segment everything,” where the template is prompted with a regular grid of foreground points yielding approximately 50 masks per image, One-Prompt attains the highest average Dice of 64.0% on 11 unseen datasets, exceeding the next best SAM-based competitor by 10.7%. The reported user-cost time is 2.28 seconds per image, versus 27.47 seconds for few/one-shot baselines.

The named SIAM framework is evaluated across eight heterogeneous datasets comprising $k^d = \{x^d_c, p^d_c\}$ 6 scans and acquisitions, including T1-weighted, T2-weighted, and CT data, adult and neonatal cohorts, pathology, and test-retest settings (Valabregue et al., 4 May 2026). On UltraCortex, cortical gray matter Dice is $k^d = \{x^d_c, p^d_c\}$ 7; on HCP, $k^d = \{x^d_c, p^d_c\}$ 8; and on dHCP, $k^d = \{x^d_c, p^d_c\}$ 9. On MICCAI_2012 manual subcortical reference labels, SIAM reaches putamen Dice $F_{\text{img}} \in \mathbb{R}^{H/4 \times W/4 \times d}$ 0 and caudate-accumbens Dice $F_{\text{img}} \in \mathbb{R}^{H/4 \times W/4 \times d}$ 1. On a private skull test set, skull Dice is $F_{\text{img}} \in \mathbb{R}^{H/4 \times W/4 \times d}$ 2 on CT, $F_{\text{img}} \in \mathbb{R}^{H/4 \times W/4 \times d}$ 3 on UTE, $F_{\text{img}} \in \mathbb{R}^{H/4 \times W/4 \times d}$ 4 on FLAIR, and $F_{\text{img}} \in \mathbb{R}^{H/4 \times W/4 \times d}$ 5 on UNI. For prediction consistency, HCP cross-contrast T1/T2 cortical gray matter Dice is $F_{\text{img}} \in \mathbb{R}^{H/4 \times W/4 \times d}$ 6, and relative atrophy error on SynthAtrophy is $F_{\text{img}} \in \mathbb{R}^{H/4 \times W/4 \times d}$ 7 across levels.

Taken together, these results show three distinct evaluation cultures. RAP-SAM is assessed by latency-aware multi-task vision benchmarks; One-Prompt by zero-shot transfer, interaction cost, and prompt-conditioned robustness; SIAM by contrast-agnostic whole-head accuracy, consistency, and morphometric sensitivity.

6. Limitations, misconceptions, and future directions

A recurrent misconception is that “segment it all” implies unconstrained segmentation without task specification. The cited works show otherwise. RAP-SAM still depends on task-appropriate query types and, for interactive use, point or box prompts encoded by SAM’s prompt encoder; text prompts are not used to condition segmentation (Xu et al., 2024). One-Prompt reduces interaction from per-image prompting to per-task prompting, but it still requires one prompted exemplar and remains sensitive to prompt quality and template choice (Wu et al., 2023). The 2026 SIAM model is broad in anatomical coverage but not open-ended: it outputs a fixed 16-class whole-head segmentation rather than arbitrary object categories (Valabregue et al., 4 May 2026).

Each variant has characteristic failure modes. RAP-SAM can miss instances in crowded scenes or heavy overlaps because of limited query or kernel budget, may degrade at higher resolutions because FLOPs scale with input size, and is constrained by memory if the number of queries increases (Xu et al., 2024). One-Prompt processes volumetric modalities slice-wise, leaving volumetric consistency to future work; sparse prompts may struggle with tiny or low-contrast targets; and domain shift across scanners or hospitals can degrade performance (Wu et al., 2023). SIAM may under-represent shape variability for extra-cerebral classes such as vessels, dura, and ventricles because it is trained from only six templates, and all models, including SIAM, struggle on severe deformations such as the “XXL ventricles” subgroup (Valabregue et al., 4 May 2026).

The future directions proposed across the papers are complementary. RAP-SAM identifies more prompt types, better distillation from visual foundation models, and mobile deployment (Xu et al., 2024). One-Prompt points to volumetric decoders, multi-prompt fusion, robust template selection, uncertainty-aware outputs, and extension beyond medical images toward a more general SIAM (Wu et al., 2023). The named SIAM framework describes a workflow for extending the label set by acquiring a few high-quality subjects, integrating new labels into the template volumes, updating shape domain-randomization rules, regenerating synthetic data, and retraining or fine-tuning nnU-Net (Valabregue et al., 4 May 2026).

In aggregate, SIAM denotes a research trajectory that attempts to collapse segmentation silos. In one branch, the emphasis is real-time multi-purpose vision under a shared decoder; in another, low-interaction universal medical segmentation via prompted exemplars; in a third, contrast-agnostic whole-head 3D segmentation from a handful of expertly labeled templates. The common principle is not a single architecture, but a commitment to broad segmentation coverage with a single model family, a unified computational graph, and a training strategy designed to preserve generalization across tasks, modalities, or acquisition conditions.