Foundation Models for 3D Bio Image Segmentation

Updated 6 October 2025

Foundation models for interactive 3D biomedical image segmentation are large-scale, promptable neural networks that generalize across modalities like MRI, CT, and ultrasound.
They integrate dynamic prompt engineering and iterative feedback to refine volumetric segmentations efficiently, mimicking real-world annotation workflows.
These models demonstrate robust performance, domain adaptability, and streamlined clinical integration, significantly reducing manual annotation burdens.

Foundation models for interactive 3D biomedical image segmentation are large-scale, promptable neural architectures designed to generalize across modalities (MRI, CT, ultrasound, microscopy), anatomical structures, and segmentation tasks with minimal training or user intervention. Unlike task-specific or organ-specific models, these models aim for robustness to domain shifts, adaptability to novel structures, and operational efficiency in clinical practice, especially regarding annotation and interactive correction workflows. Central to their utility is the ability to incorporate user guidance—through points, scribbles, bounding boxes, or even natural language prompts—to iteratively refine volumetric segmentations in an efficient, accurate, and scalable manner.

1. Model Architectures and Fundamental Design Principles

Foundation models for interactive 3D biomedical segmentation employ diverse architectures, generally built upon either transformer-based models, advanced convolutional backbones, or hybrid systems.

Transformer-Based Architectures: The Segment Anything Model (SAM) and its successors (e.g., SAM2, SAM-Med3D, BioSAM-2) use a stack of ViT-style encoders to process high-resolution 2D/3D inputs, with prompts encoded separately and fused into the network via cross-attention or memory mechanisms. Adapters (e.g., ProMISe, MedSAM, MedicoSAM) and fine-tuning strategies help bridge domain gaps and inject depth-awareness into originally 2D architectures (Lee et al., 15 Jan 2024, Li et al., 2023, Archit et al., 20 Jan 2025).
CNN and Hybrid Models: Architectures like LIM-Net (Shen et al., 11 Dec 2024) or VISTA3D (He et al., 7 Jun 2024) integrate 2D CNNs or 3D CNNs with prompt-aware modules, facilitating lightweight and fast inference on volumetric data. Multi-path encoders for different modalities (as in F3-Net (Otaghsara et al., 11 Jul 2025)) support robust processing of missing or heterogeneous inputs.
Unified Segmentation Pipelines: BiomedParse (Zhao et al., 21 May 2024) demonstrates a unified model for segmentation, detection, and recognition across many object types and modalities, leveraging joint training with both visual and text encoders. nnInteractive (Isensee et al., 11 Mar 2025) provides a natively 3D, prompt-dense (points, boxes, scribbles, lasso) pipeline directly integrated with clinical viewers.
Prompt Encoders and Integration: Different models support point, box, scribble, lasso, and text prompts, using relative positional encodings (e.g., ENSAM (Stenhede et al., 19 Sep 2025)) and latent cross-attention to connect user inputs to the image embeddings. Dynamic prompt fusion (including iterative or multi-round prompt strategies (Ndir et al., 3 Oct 2025)) allows for interactive refinement and realistic simulation of human correction workflows.

2. Interactive Training Strategies and Prompt Engineering

Simulating realistic, interactive correction during training is a cornerstone of recent foundation models.

Dynamic Prompt Generation: Simulated users provide corrective prompts targeting the largest error regions in the segmentation mask, determined via binary error maps and connected component analysis. A typical pipeline computes the (3D) Euclidean distance transform of error components to select click locations maximizing user impact (Ndir et al., 3 Oct 2025).
Iterative Correction and Multi-Round Training: Models are trained to process not just initial user prompts but also subsequent feedback, carrying forward segmentation predictions from previous rounds and generating new prompts based on remaining errors. This mimics real-world annotation workflows and enhances prompt-response efficiency.
Quality-Adaptive Merging: The Multi-Round Result Fusion (MRF) module (Shen et al., 11 Dec 2024) uses a quality assessment network to select or merge the best segmentation result per slice following each interaction, ensuring monotonic improvement and reducing superfluous interaction.
Content-Aware Cropping and Efficient Patch Sampling: Content-aware adaptive cropping processes only the relevant volume around the anatomy of interest, reducing memory and allowing larger effective volumes (e.g., up to 192³) to be processed in a single training or inference step (Ndir et al., 3 Oct 2025).
Stochastic Interaction Sampling: To increase robustness, models can randomly activate “no-click” or “single-click” simulation in training, preventing overfitting to any single interaction regime.

3. Adaptation to Volumetric and Multi-Modality Data

Foundation models bridge 2D-to-3D gaps and accommodate diverse imaging inputs through several mechanisms:

Volumetric Aggregation: Slice-wise models concatenate features from adjacent slices and process them in a 3D decoder to restore spatial context (Lee et al., 15 Jan 2024). Methods such as treating 2D slices as video frames and propagating prompts or masklets through the 3D volume are actively pursued in SAM2-based approaches (Shen et al., 5 Aug 2024, Yan et al., 6 Aug 2024). VISTA3D (He et al., 7 Jun 2024) and nnInteractive (Isensee et al., 11 Mar 2025) represent architectures built natively for 3D volumetric data.
Supervoxel-based Zero-Shot: 3D supervoxels generated from 2D pretrained features (e.g., from SAM) are used to impart zero-shot capability for novel anatomical structures in VISTA3D (He et al., 7 Jun 2024).
Flexible Modality Handling: In contexts with missing MRI sequences, F3-Net (Otaghsara et al., 11 Jul 2025) creates “zero-image” placeholders, with downstream feature maps set to zero, avoiding hallucination and maintaining segmentation quality across absent modalities.
Multi-Path Encoders: In F3-Net, each modality is processed in a dedicated encoder path, and their outputs are fused in a shared decoder, supporting simultaneous segmentation of multiple pathologies without retraining.

4. Evaluation Metrics and Benchmarking

Standardized, multi-dimensional evaluation is critical for comparing segmentation models.

Overlap and Boundary Metrics: The Dice Similarity Coefficient (DSC) quantifies volumetric overlap; Normalized Surface Dice (NSD), Average Symmetric Surface Distance (ASD), and Hausdorff distance (HD95) assess boundary alignment (Li et al., 10 Jul 2024, He et al., 15 Jan 2025).
Area-Under-Curve (AUC) Metrics: In interactive settings, evaluating DSC and NSD after each interaction and calculating the AUC (cumulative score across a fixed number of correction rounds) captures both refinement speed and eventual segmentation quality (Ndir et al., 3 Oct 2025, Stenhede et al., 19 Sep 2025).
Real-World Datasets and Scenarios: Models are benchmarked on diverse, large-scale datasets such as IMed-361M (Cheng et al., 19 Nov 2024), BiomedParseData (Zhao et al., 21 May 2024), BraTS, ISLES, and multicenter clinical cohorts (across CT, MRI, PET, ultrasound, microscopy), encompassing hundreds of anatomical targets and varied clinical settings.
Annotation Efficiency: A central objective is reducing the number of user corrections needed to reach high segmentation accuracy (e.g., 90% Dice), with some approaches reporting a 30–60% reduction compared to baselines (Wong et al., 19 Dec 2024, Shen et al., 11 Dec 2024).

5. Generalization, Adaptation, and Knowledge Retention

Ensuring long-term utility and adaptability is a major focus of foundation models:

Domain Adaptation and Few-Shot Approaches: Transfer learning and adapter tuning (ProMISe (Li et al., 2023), BioSAM-2 (Yan et al., 6 Aug 2024), FATE-SAM (He et al., 15 Jan 2025)) enable strong zero-shot and few-shot 3D segmentation, often leveraging a support library of annotated examples and attention-based fusion for segmenting unseen anatomical structures.
Sequential Fine-Tuning: MedSeqFT (Ye et al., 7 Sep 2025) formalizes sequential, knowledge-retentive adaptation for foundation models. Through Maximum Data Similarity (MDS) selection and LoRA-based knowledge distillation, models can be tuned to multiple, emergent tasks while minimizing catastrophic forgetting and retaining strong performance on earlier or unseen domains.
In-Context and Memory-Based Reasoning: MultiverSeg (Wong et al., 19 Dec 2024) introduces in-context guidance, accumulating segmented examples in a “context set” that is progressively used to inform the segmentation of new, related images with fewer interactions, amortizing annotation effort across datasets.

6. Efficiency, Scalability, and Clinical Integration

Deployment efficiency and seamless clinical integration are essential design criteria:

Model Size and Optimization: Lightweight models such as ENSAM (Stenhede et al., 19 Sep 2025) employ efficient SegResNet-based backbones, relative positional encodings, normalized attention, and advanced optimizers (Muon) to achieve strong performance under tight computational and memory budgets.
Integration with Clinical Platforms: Tools like nnInteractive (Isensee et al., 11 Mar 2025) are implemented as plugins for Napari and MITK, allowing direct adoption in established clinical and research workflows, supporting intuitive 2D interactions with full 3D segmentation outputs.
Robustness to User Variability: Support for diverse prompt types (points, scribbles, bounding boxes, text, lasso) and both positive/negative cues increases robustness and usability in practice, accommodating the variability in annotation styles and clinical needs (Isensee et al., 11 Mar 2025, Shen et al., 11 Dec 2024).

7. Limitations and Future Directions

While foundation models for interactive 3D segmentation have demonstrated substantial progress, several challenges and research frontiers persist:

Spatial Continuity and Context: Many methods still process 3D data slice-wise or in 2.5D, limiting spatial coherence. There is a drive toward truly volumetric attention and decoding architectures to improve 3D consistency (Lee et al., 15 Jan 2024, He et al., 7 Jun 2024, Yan et al., 6 Aug 2024).
Instance Segmentation and Semantic Labels: While semantic segmentation is mature, distinguishing multiple instances of the same object class (e.g., individual cells in pathology) or associating class-level semantics remains limited and usually requires post-processing (Zhao et al., 21 May 2024, Yan et al., 6 Aug 2024).
Universal Architecture and Training: Unified joint learning for segmentation, detection, and recognition across all relevant biomedical structures is possible only with integrated architectures capable of handling diverse input types and supervision (e.g., BiomedParse (Zhao et al., 21 May 2024)). Extension to interactive dialogue (e.g., natural language feedback) is an active area for expansion.
Generalization to Rare Conditions and Modalities: Domain adaptation to rare diseases or modalities with little labeled data remains an open issue. Efficient sampling strategies (e.g., MDS in MedSeqFT (Ye et al., 7 Sep 2025)) and hybrid self-supervised strategies are a key research theme.
Annotation and Resource Constraints: Lightweight designs and memory-efficient propagation (as in LIM-Net (Shen et al., 11 Dec 2024) and content-aware cropping (Ndir et al., 3 Oct 2025)) are critical for deployment in settings with restricted resources, demanding continual advances in architectural and algorithmic efficiency.

Foundation models for interactive 3D biomedical image segmentation represent a convergence of advances in large-scale pretraining, prompt engineering, efficient and robust architecture design, and realistic simulation of clinical annotation workflows. As models continue to bridge domain gaps, support open-set operation, and minimize annotation burden, their impact across clinical research, diagnostic radiology, and multi-modal quantitative analysis will remain transformative. Recent benchmarks and competitive evaluations underscore both their capabilities and the remaining challenges, directing future efforts toward seamless, robust, and universally deployable segmentation solutions (Lee et al., 15 Jan 2024, Li et al., 2023, He et al., 7 Jun 2024, Isensee et al., 11 Mar 2025, Ndir et al., 3 Oct 2025).