Papers
Topics
Authors
Recent
2000 character limit reached

Foundation-Model Interactive Segmentation

Updated 22 January 2026
  • Foundation-model-driven interactive segmentation is an approach that uses large-scale pretrained visual models combined with user prompts to generate and refine segmentation masks.
  • It leverages architectures like Vision Transformers, Swin Transformers, and convolutional encoders with multi-modal prompt inputs such as clicks, boxes, scribbles, and text.
  • Applications span natural, medical, robotic, and remote sensing domains, offering efficiency gains, reduced annotation cost, and adaptable performance in diverse tasks.

Foundation-model-driven interactive segmentation refers to interactive image segmentation workflows in which a large-scale, pretrained vision foundation model (FM) serves as the backbone, enabling broad, adaptable, and prompt-driven prediction of segmentation masks. In contrast to one-shot or fully automatic segmentation pipelines, the interactive setting leverages user-provided prompts—such as points, boxes, scribbles, text, or semantic queries—to incrementally refine segmentation outputs. The paradigm has rapidly expanded across modalities such as natural images, medical imaging, robotics, and remote sensing, unifying underlying models (e.g., SAM, DINOv3, Swin/SAM3) and diverse prompt interfaces to enable efficient, adaptable, and accurate interactive segmentation in data-scarce or open-vocabulary tasks.

1. Architectural Principles and Model Variants

Foundation-model-driven interactive segmentation typically builds upon high-capacity, pretrained backbones such as Vision Transformers (ViTs), hierarchical Swin transformers, or convolutional 3D encoders, adapted to dense prediction by means of prompt encoders and mask decoders. The Segment Anything Model (SAM) and its medical, remote sensing, or open-vocabulary derivatives form the core of current frameworks (Zhou et al., 2024, Archit et al., 20 Jan 2025, Jiang et al., 15 Jan 2026).

Core pipeline:

Unique architectural advances include:

2. Prompt Modalities and Interactive Workflows

Prompt-driven interactivity distinguishes these systems from earlier “black-box” dense-prediction backbones. Prompts can take various forms:

The interactive workflow generally involves:

  1. User supplies an initial prompt (e.g., click or text).
  2. Model outputs a segmentation mask.
  3. User refines with additional prompts (e.g., more clicks, corrective strokes, semantic queries).
  4. Model produces a new mask, and the cycle repeats until performance is sufficient (Archit et al., 20 Jan 2025, He et al., 2024, Heinemann et al., 2024).

Sophisticated frameworks simulate human-in-the-loop refinement, dynamically generating prompts based on error maps or uncertainty (e.g., SAT3D uses critic-driven uncertainty maps to guide subsequent corrections (Peiris et al., 11 Nov 2025); SafeClick fuses imperfect prompts via consensus (Gao et al., 23 Jun 2025)), and support multi-modal or hybrid prompts to match clinical or robotic workflow requirements (Zhang et al., 20 Dec 2025, Cheng et al., 2024).

3. Training Protocols and Loss Functions

Training of foundation-model-driven interactive segmentation models varies in degree of adaptation:

Loss functions are typically combinations of (weighted) Dice, cross-entropy/focal loss on predicted-vs-ground-truth masks, and possibly auxiliary losses over boundary, classification, IoU score, adversarial uncertainty, or detection/box prediction (for models also supporting detection or open-vocabulary prompts) (Archit et al., 20 Jan 2025, Jiang et al., 15 Jan 2026, Peiris et al., 11 Nov 2025, Park et al., 28 Apr 2025).

Self-supervised or pseudo-mask schemes are widely used in domains lacking dense mask annotations (e.g., union/intersection mask pseudo-labeling in human-object interaction (Park et al., 28 Apr 2025), DINOv3-powered feature fusion for histology (Zhang et al., 15 Jan 2026)), and error-driven “simulated user” pipelines optimize for realistic annotation patterns (Ndir et al., 3 Oct 2025, Havrylov et al., 4 May 2025).

4. Application Domains and Representative Models

Foundation-model-driven interactive segmentation has achieved traction in multiple applications:

Natural and everyday scenes:

  • SAM and HQ-SAM for generic interactive segmentation (Zhou et al., 2024).
  • Seg2HOI for integrating foundation segmentation models with human-object interaction prediction, producing quadruplet outputs (box, interaction label, and union/intersection masks) (Park et al., 28 Apr 2025).

Medical imaging:

  • MedicoSAM: Fine-tuned SAM for 2D/3D interactive segmentation across >15 million masks, yielding consistent 3–5% Dice gain in interactive settings (Archit et al., 20 Jan 2025).
  • Medical SAM3: Fully fine-tuned Swin/SAM3 architecture on 33 medical datasets, supporting prompt-driven, cross-modal, and cross-organ segmentation (Jiang et al., 15 Jan 2026).
  • VISTA3D: Unified foundation model for 3D medical imaging integrating automatic, interactive, and zero-shot supervoxel segmentation (He et al., 2024).
  • AtlasSegFM: One-shot atlas-guided pipeline fusing registration priors with foundation-model predictions, excelling at underrepresented or small anatomical structures (Zhang et al., 20 Dec 2025).
  • SafeClick: Error-tolerant, plug-in decoder improving prompt robustness for SAM2/MedSAM2 through hierarchical expert consensus (Gao et al., 23 Jun 2025).
  • SAT3D: 3D Swin backbone with uncertainty-aware training (adversarial critic), supporting prompt-driven segmentation in 3D Slicer and uncertainty-guided correction (Peiris et al., 11 Nov 2025).
  • ENSAM: Lightweight 3D model with equivariant positional encoding, demonstrating fast, interactive segmentation training from scratch (Stenhede et al., 19 Sep 2025).
  • LIMIS: Purely language-based interactive medical segmentation via LoRA-adapted Grounded DINO and text-to-mask loop (Heinemann et al., 2024).

Remote sensing:

  • RS-ISRefiner: Adapter-based framework for click-driven segmentation in high-resolution earth imagery using hybrid convolutional/transformer adapters and modulation schemes (Wang et al., 30 Nov 2025).

Robotics and egocentric perception:

  • rt-RISeg: Model-free robot pipeline using body-frame-invariant features to generate segmentation masks, which serve as high-quality prompts for subsequent foundation model refinement (Qian et al., 14 Jul 2025).
  • Gaze-driven prompting and SAM for object segmentation in vision-assisted neuro-prosthetic scenarios (Atoki et al., 24 Jul 2025).

Histology and neuroscience:

  • DINOv3-driven interactive brain region parcellation, leveraging multi-block feature fusion and lightweight decoder fine-tuned on sparse scribbles (Zhang et al., 15 Jan 2026).

5. Quantitative Performance and Empirical Insights

Across evaluation studies, foundation-model-driven interactive segmentation achieves competitive (often near state-of-the-art) results with reduced annotation effort and high flexibility:

  • Medical: MedicoSAM achieved mean Dice improvements from 0.81 (SAM) to 0.84 (MedicoSAM, 2D initial point), with further gain after iterative corrections; VISTA3D outperformed nnU-Net and segment-anything baselines in several 3D and zero-shot settings, especially after few interactive prompts (Archit et al., 20 Jan 2025, He et al., 2024, Jiang et al., 15 Jan 2026). AtlasSegFM showed +38pp Dice on small, underrepresented structures compared to baseline FM (Zhang et al., 20 Dec 2025).
  • Generic: HQ-SAM yielded +0.04–0.05 mIoU over regular SAM per click; Seg2HOI matched or exceeded detection-based HOI methods in both closed-set and zero-shot settings while adding instance mask prediction (Park et al., 28 Apr 2025).
  • Remote sensing: RS-ISRefiner reduced click count and non-convergence in high-complexity scenes (Wang et al., 30 Nov 2025).
  • Robotics: rt-RISeg→SAM improved overlap F₁ by +23 points over foundation model alone, with robust boundary segmentation under real-world interactions (Qian et al., 14 Jul 2025).
  • Efficiency: ENSAM achieved state-of-the-art interactive 3D Dice with ∼5.5M parameters, trained in six hours; inference-time prompt encoding strategies amortize repeated forward passes (Stenhede et al., 19 Sep 2025).

6. Key Challenges, Limitations, and Future Directions

Major limitations include:

  • Prompt ambiguity and robustness: Segmentation quality can degrade with imperfect user input; error-tolerant or consensus-based decoders (e.g., SafeClick) partially address this (Gao et al., 23 Jun 2025).
  • Domain adaptation: Severe domain shifts (natural→medical, or static→egocentric scenes) significantly impair vanilla FM performance; full fine-tuning or adapter-based strategies are often required (Jiang et al., 15 Jan 2026, Wang et al., 30 Nov 2025).
  • Semantic grounding: While class-agnostic masks are effective for structure delineation, open-vocabulary or text-driven segmentation remains challenging, with ongoing work integrating CLIP, LLMs, or language-guided decoders (Heinemann et al., 2024, Jiang et al., 15 Jan 2026).
  • Data annotation cost: Large-scale interactive datasets with click/multi-modal annotations remain scarce; automated mask generation (leveraging foundation models) and simulation loops are key enablers (Cheng et al., 2024).
  • Real-time/dense interaction: Model size and transformer inference cost limit low-latency deployment; lightweight and hardware-aware models are active research areas (Stenhede et al., 19 Sep 2025, Zhou et al., 2024).

Anticipated directions include end-to-end multi-modal adaptive learning, full integration of language and visual cues, in-context prompt adaptation (analogous to LLMs), explainability/uncertainty estimation, edge-compatible lightweight models, and domain-agnostic extensibility to new sensor types, imaging modalities, or interaction paradigms (Zhou et al., 2024).

7. Representative Models and Deployment Scenarios

Model Backbones Domain(s) Prompt Types Distinctive Features
SAM, HQ-SAM ViT General (natural/medical) Points, Boxes Zero-shot, class-agnostic, editable output tokens
MedicoSAM, SAM3 ViT, Swin Multi-modal medical Points, Boxes, Text End-to-end fine-tuned, text-driven segmentation
VISTA3D, ENSAM 3D CNNs/U-Nets 3D medical Clicks, Supported Class Auto/interact/zero-shot, fast 3D inference
AtlasSegFM FM + deformable reg Medical Mask, Box, Point Atlas warping + FM fusion at test time
SafeClick FM + plug-in decoder Medical Imperfect prompts Collaborative expert/consensus for prompt resilience
RS-ISRefiner ViT + adapters Remote sensing Click-based Adapter tuning, hybrid attention
Limis Grounded DINO+SAM Medical/CT Language-only LoRA, language-driven HCI loop
rt-RISeg Model-free/any FM Robotics Mask/box (from BFIF) Motion-based segmentation → FM-driven refinement

Successful deployments span point-and-click annotation tools, radiology pipelines (e.g., 3D Slicer plugins (Peiris et al., 11 Nov 2025)), atlas-based segmentation in underrepresented contexts (Zhang et al., 20 Dec 2025), robotic grasping in unstructured scenes (Qian et al., 14 Jul 2025), and gaze-based egocentric control (Atoki et al., 24 Jul 2025). Prompt-driven, foundation-model-powered segmentation is converging toward universal, interactive, and modal-agnostic frameworks across vision domains.


References: Key models and benchmarks: (Zhou et al., 2024, Archit et al., 20 Jan 2025, Jiang et al., 15 Jan 2026, He et al., 2024, Zhang et al., 20 Dec 2025, Wang et al., 30 Nov 2025, Park et al., 28 Apr 2025, Gao et al., 23 Jun 2025, Peiris et al., 11 Nov 2025, Qian et al., 14 Jul 2025, Zhang et al., 15 Jan 2026, Havrylov et al., 4 May 2025, Heinemann et al., 2024, Cheng et al., 2024, Atoki et al., 24 Jul 2025, Stenhede et al., 19 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (17)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Foundation-Model-Driven Interactive Segmentation.