Foundation-Model Interactive Segmentation
- Foundation-model-driven interactive segmentation is an approach that uses large-scale pretrained visual models combined with user prompts to generate and refine segmentation masks.
- It leverages architectures like Vision Transformers, Swin Transformers, and convolutional encoders with multi-modal prompt inputs such as clicks, boxes, scribbles, and text.
- Applications span natural, medical, robotic, and remote sensing domains, offering efficiency gains, reduced annotation cost, and adaptable performance in diverse tasks.
Foundation-model-driven interactive segmentation refers to interactive image segmentation workflows in which a large-scale, pretrained vision foundation model (FM) serves as the backbone, enabling broad, adaptable, and prompt-driven prediction of segmentation masks. In contrast to one-shot or fully automatic segmentation pipelines, the interactive setting leverages user-provided prompts—such as points, boxes, scribbles, text, or semantic queries—to incrementally refine segmentation outputs. The paradigm has rapidly expanded across modalities such as natural images, medical imaging, robotics, and remote sensing, unifying underlying models (e.g., SAM, DINOv3, Swin/SAM3) and diverse prompt interfaces to enable efficient, adaptable, and accurate interactive segmentation in data-scarce or open-vocabulary tasks.
1. Architectural Principles and Model Variants
Foundation-model-driven interactive segmentation typically builds upon high-capacity, pretrained backbones such as Vision Transformers (ViTs), hierarchical Swin transformers, or convolutional 3D encoders, adapted to dense prediction by means of prompt encoders and mask decoders. The Segment Anything Model (SAM) and its medical, remote sensing, or open-vocabulary derivatives form the core of current frameworks (Zhou et al., 2024, Archit et al., 20 Jan 2025, Jiang et al., 15 Jan 2026).
Core pipeline:
- Image encoder: Transforms the image into dense, multi-scale features using a frozen or fine-tuned ViT (e.g., ViT-Base, Swin Transformer) (Archit et al., 20 Jan 2025, Jiang et al., 15 Jan 2026).
- Prompt encoder: Converts various user interactions (points, boxes, mask strokes, language) into prompt tokens; common strategies include spatial embeddings for clicks/boxes and CLIP-style text embeddings for free-form prompts (Cheng et al., 2024, Heinemann et al., 2024).
- Mask decoder: Typically a lightweight transformer or U-Net structure; attends jointly to image and prompt features and produces one or multiple segmentation mask logits (Archit et al., 20 Jan 2025, He et al., 2024, Stenhede et al., 19 Sep 2025).
Unique architectural advances include:
- Adapter-based tuning for domain transfer without updating backbone weights (Wang et al., 30 Nov 2025).
- Multi-granularity/multi-head decoders for hierarchical or semantic segmentation (e.g., HQ-SAM, supervoxel methods) (Zhou et al., 2024, He et al., 2024).
- Structured prompt-fusion via attention or cross-modal reasoning (e.g., SafeClick’s expert layers, prompt interaction fusion) (Gao et al., 23 Jun 2025).
- Specialized modules for language-only segmentation (e.g., LIMIS employing Grounded DINO and a language-guided prompt loop) (Heinemann et al., 2024).
2. Prompt Modalities and Interactive Workflows
Prompt-driven interactivity distinguishes these systems from earlier “black-box” dense-prediction backbones. Prompts can take various forms:
- Clicks/points: Most widely supported, with positive/negative annotation (“foreground/background”) (Archit et al., 20 Jan 2025, Cheng et al., 2024).
- Bounding boxes: Specifies spatial extent, often for initialization or coarse guidance (Archit et al., 20 Jan 2025, Zhang et al., 20 Dec 2025).
- Masks/scribbles: Low-resolution or sparse signals capturing region-of-interest shape (Cheng et al., 2024).
- Text/language prompts: CLIP-style or explicit organ/object queries for zero-shot or open-vocabulary segmentation (Cheng et al., 2024, Heinemann et al., 2024, Jiang et al., 15 Jan 2026).
- Image-derived cues: Atlas priors (warped masks), gaze-tracking based points (robotics, egocentric vision), or previous mask estimates for iterative refinement (Zhang et al., 20 Dec 2025, Qian et al., 14 Jul 2025, Atoki et al., 24 Jul 2025).
The interactive workflow generally involves:
- User supplies an initial prompt (e.g., click or text).
- Model outputs a segmentation mask.
- User refines with additional prompts (e.g., more clicks, corrective strokes, semantic queries).
- Model produces a new mask, and the cycle repeats until performance is sufficient (Archit et al., 20 Jan 2025, He et al., 2024, Heinemann et al., 2024).
Sophisticated frameworks simulate human-in-the-loop refinement, dynamically generating prompts based on error maps or uncertainty (e.g., SAT3D uses critic-driven uncertainty maps to guide subsequent corrections (Peiris et al., 11 Nov 2025); SafeClick fuses imperfect prompts via consensus (Gao et al., 23 Jun 2025)), and support multi-modal or hybrid prompts to match clinical or robotic workflow requirements (Zhang et al., 20 Dec 2025, Cheng et al., 2024).
3. Training Protocols and Loss Functions
Training of foundation-model-driven interactive segmentation models varies in degree of adaptation:
- Frozen backbone, trainable head/adapters: Only prompt encoder and decoder are updated; backbone remains frozen, often for parameter efficiency and ease of plug-and-play deployment in new domains (Wang et al., 30 Nov 2025, Gao et al., 23 Jun 2025).
- Partial or full fine-tuning: All or selected backbone layers are updated, typically yielding substantial gains under large domain shifts (e.g., Medical SAM3, MedicoSAM, VISTA3D) (Jiang et al., 15 Jan 2026, He et al., 2024, Archit et al., 20 Jan 2025).
- Test-time registration/adaptation: AtlasSegFM fuses alignment-derived priors with foundation model predictions without retraining (Zhang et al., 20 Dec 2025).
Loss functions are typically combinations of (weighted) Dice, cross-entropy/focal loss on predicted-vs-ground-truth masks, and possibly auxiliary losses over boundary, classification, IoU score, adversarial uncertainty, or detection/box prediction (for models also supporting detection or open-vocabulary prompts) (Archit et al., 20 Jan 2025, Jiang et al., 15 Jan 2026, Peiris et al., 11 Nov 2025, Park et al., 28 Apr 2025).
Self-supervised or pseudo-mask schemes are widely used in domains lacking dense mask annotations (e.g., union/intersection mask pseudo-labeling in human-object interaction (Park et al., 28 Apr 2025), DINOv3-powered feature fusion for histology (Zhang et al., 15 Jan 2026)), and error-driven “simulated user” pipelines optimize for realistic annotation patterns (Ndir et al., 3 Oct 2025, Havrylov et al., 4 May 2025).
4. Application Domains and Representative Models
Foundation-model-driven interactive segmentation has achieved traction in multiple applications:
Natural and everyday scenes:
- SAM and HQ-SAM for generic interactive segmentation (Zhou et al., 2024).
- Seg2HOI for integrating foundation segmentation models with human-object interaction prediction, producing quadruplet outputs (box, interaction label, and union/intersection masks) (Park et al., 28 Apr 2025).
Medical imaging:
- MedicoSAM: Fine-tuned SAM for 2D/3D interactive segmentation across >15 million masks, yielding consistent 3–5% Dice gain in interactive settings (Archit et al., 20 Jan 2025).
- Medical SAM3: Fully fine-tuned Swin/SAM3 architecture on 33 medical datasets, supporting prompt-driven, cross-modal, and cross-organ segmentation (Jiang et al., 15 Jan 2026).
- VISTA3D: Unified foundation model for 3D medical imaging integrating automatic, interactive, and zero-shot supervoxel segmentation (He et al., 2024).
- AtlasSegFM: One-shot atlas-guided pipeline fusing registration priors with foundation-model predictions, excelling at underrepresented or small anatomical structures (Zhang et al., 20 Dec 2025).
- SafeClick: Error-tolerant, plug-in decoder improving prompt robustness for SAM2/MedSAM2 through hierarchical expert consensus (Gao et al., 23 Jun 2025).
- SAT3D: 3D Swin backbone with uncertainty-aware training (adversarial critic), supporting prompt-driven segmentation in 3D Slicer and uncertainty-guided correction (Peiris et al., 11 Nov 2025).
- ENSAM: Lightweight 3D model with equivariant positional encoding, demonstrating fast, interactive segmentation training from scratch (Stenhede et al., 19 Sep 2025).
- LIMIS: Purely language-based interactive medical segmentation via LoRA-adapted Grounded DINO and text-to-mask loop (Heinemann et al., 2024).
Remote sensing:
- RS-ISRefiner: Adapter-based framework for click-driven segmentation in high-resolution earth imagery using hybrid convolutional/transformer adapters and modulation schemes (Wang et al., 30 Nov 2025).
Robotics and egocentric perception:
- rt-RISeg: Model-free robot pipeline using body-frame-invariant features to generate segmentation masks, which serve as high-quality prompts for subsequent foundation model refinement (Qian et al., 14 Jul 2025).
- Gaze-driven prompting and SAM for object segmentation in vision-assisted neuro-prosthetic scenarios (Atoki et al., 24 Jul 2025).
Histology and neuroscience:
- DINOv3-driven interactive brain region parcellation, leveraging multi-block feature fusion and lightweight decoder fine-tuned on sparse scribbles (Zhang et al., 15 Jan 2026).
5. Quantitative Performance and Empirical Insights
Across evaluation studies, foundation-model-driven interactive segmentation achieves competitive (often near state-of-the-art) results with reduced annotation effort and high flexibility:
- Medical: MedicoSAM achieved mean Dice improvements from 0.81 (SAM) to 0.84 (MedicoSAM, 2D initial point), with further gain after iterative corrections; VISTA3D outperformed nnU-Net and segment-anything baselines in several 3D and zero-shot settings, especially after few interactive prompts (Archit et al., 20 Jan 2025, He et al., 2024, Jiang et al., 15 Jan 2026). AtlasSegFM showed +38pp Dice on small, underrepresented structures compared to baseline FM (Zhang et al., 20 Dec 2025).
- Generic: HQ-SAM yielded +0.04–0.05 mIoU over regular SAM per click; Seg2HOI matched or exceeded detection-based HOI methods in both closed-set and zero-shot settings while adding instance mask prediction (Park et al., 28 Apr 2025).
- Remote sensing: RS-ISRefiner reduced click count and non-convergence in high-complexity scenes (Wang et al., 30 Nov 2025).
- Robotics: rt-RISeg→SAM improved overlap F₁ by +23 points over foundation model alone, with robust boundary segmentation under real-world interactions (Qian et al., 14 Jul 2025).
- Efficiency: ENSAM achieved state-of-the-art interactive 3D Dice with ∼5.5M parameters, trained in six hours; inference-time prompt encoding strategies amortize repeated forward passes (Stenhede et al., 19 Sep 2025).
6. Key Challenges, Limitations, and Future Directions
Major limitations include:
- Prompt ambiguity and robustness: Segmentation quality can degrade with imperfect user input; error-tolerant or consensus-based decoders (e.g., SafeClick) partially address this (Gao et al., 23 Jun 2025).
- Domain adaptation: Severe domain shifts (natural→medical, or static→egocentric scenes) significantly impair vanilla FM performance; full fine-tuning or adapter-based strategies are often required (Jiang et al., 15 Jan 2026, Wang et al., 30 Nov 2025).
- Semantic grounding: While class-agnostic masks are effective for structure delineation, open-vocabulary or text-driven segmentation remains challenging, with ongoing work integrating CLIP, LLMs, or language-guided decoders (Heinemann et al., 2024, Jiang et al., 15 Jan 2026).
- Data annotation cost: Large-scale interactive datasets with click/multi-modal annotations remain scarce; automated mask generation (leveraging foundation models) and simulation loops are key enablers (Cheng et al., 2024).
- Real-time/dense interaction: Model size and transformer inference cost limit low-latency deployment; lightweight and hardware-aware models are active research areas (Stenhede et al., 19 Sep 2025, Zhou et al., 2024).
Anticipated directions include end-to-end multi-modal adaptive learning, full integration of language and visual cues, in-context prompt adaptation (analogous to LLMs), explainability/uncertainty estimation, edge-compatible lightweight models, and domain-agnostic extensibility to new sensor types, imaging modalities, or interaction paradigms (Zhou et al., 2024).
7. Representative Models and Deployment Scenarios
| Model | Backbones | Domain(s) | Prompt Types | Distinctive Features |
|---|---|---|---|---|
| SAM, HQ-SAM | ViT | General (natural/medical) | Points, Boxes | Zero-shot, class-agnostic, editable output tokens |
| MedicoSAM, SAM3 | ViT, Swin | Multi-modal medical | Points, Boxes, Text | End-to-end fine-tuned, text-driven segmentation |
| VISTA3D, ENSAM | 3D CNNs/U-Nets | 3D medical | Clicks, Supported Class | Auto/interact/zero-shot, fast 3D inference |
| AtlasSegFM | FM + deformable reg | Medical | Mask, Box, Point | Atlas warping + FM fusion at test time |
| SafeClick | FM + plug-in decoder | Medical | Imperfect prompts | Collaborative expert/consensus for prompt resilience |
| RS-ISRefiner | ViT + adapters | Remote sensing | Click-based | Adapter tuning, hybrid attention |
| Limis | Grounded DINO+SAM | Medical/CT | Language-only | LoRA, language-driven HCI loop |
| rt-RISeg | Model-free/any FM | Robotics | Mask/box (from BFIF) | Motion-based segmentation → FM-driven refinement |
Successful deployments span point-and-click annotation tools, radiology pipelines (e.g., 3D Slicer plugins (Peiris et al., 11 Nov 2025)), atlas-based segmentation in underrepresented contexts (Zhang et al., 20 Dec 2025), robotic grasping in unstructured scenes (Qian et al., 14 Jul 2025), and gaze-based egocentric control (Atoki et al., 24 Jul 2025). Prompt-driven, foundation-model-powered segmentation is converging toward universal, interactive, and modal-agnostic frameworks across vision domains.
References: Key models and benchmarks: (Zhou et al., 2024, Archit et al., 20 Jan 2025, Jiang et al., 15 Jan 2026, He et al., 2024, Zhang et al., 20 Dec 2025, Wang et al., 30 Nov 2025, Park et al., 28 Apr 2025, Gao et al., 23 Jun 2025, Peiris et al., 11 Nov 2025, Qian et al., 14 Jul 2025, Zhang et al., 15 Jan 2026, Havrylov et al., 4 May 2025, Heinemann et al., 2024, Cheng et al., 2024, Atoki et al., 24 Jul 2025, Stenhede et al., 19 Sep 2025).