CellSAM: Robust Cell Segmentation

Updated 18 June 2026

CellSAM is a suite of methodologies integrating the Segment Anything Model for enhanced cell segmentation, counting, and tracking across microscopy images.
It employs a novel CellFinder detection head with optimized box prompts to reduce segmentation errors and enable high-throughput bioimaging.
The framework supports flexible architectures including dual-backbone fusion and zero-shot pipelines, achieving state-of-the-art performance on diverse benchmarks.

CellSAM defines a family of methodologies and architectures that leverage the Segment Anything Model (SAM) and its derivatives to enable robust, versatile cell instance segmentation, counting, and tracking in diverse microscopy imaging contexts. These approaches address limitations of specialist models by combining prompt-driven mask generation, domain-adapted detection heads, and, in some cases, dual-backbone fusion architectures or zero-shot prompting regimes. CellSAM not only attains state-of-the-art (SOTA) performance on canonical segmentation benchmarks across mammalian tissues, cultured cells, yeast, and bacteria, but also provides a universally deployable foundation for large-scale annotation, high-throughput biology, and downstream single-cell analytics (Israel et al., 2023, Archit et al., 18 Mar 2026, Fang et al., 23 Jul 2025).

1. Background and Motivation

Cell segmentation is a foundational task for quantitative microscopy, underpinning applications from spatial omics to drug discovery. Historically, prevailing tools such as Cellpose or Mesmer have relied on flow-field or centroid-boundary representations, which restrict their generalizability to specific imaging modalities or morphological regimes. These methods are sensitive to changes in cell morphology, imaging artifacts, and annotation sparsity, thus necessitating either retraining or intensive prompt engineering across new datasets (Israel et al., 2023). The advent of promptable vision foundation models, particularly SAM, facilitated the design of "foundation models for cells"—models that can be prompted for instance masks in a domain-agnostic or rapidly adaptable fashion.

CellSAM generalizes this paradigm by integrating a detection head (CellFinder) for automated prompt generation, providing end-to-end mask inference with minimal manual intervention (Israel et al., 2023, Archit et al., 18 Mar 2026). This allows adaptation to dense fields, multiclass cell types, and highly variable microscopy modalities, addressing limitations observed in earlier architectures that rigidly associate pixels with cells or cannot handle overlapping objects.

2. Core Architecture and Prompting Strategies

CellSAM architectures are typically constructed upon the original SAM backbone, which includes:

A ViT-based image encoder pretrained on natural images and a billion masks.
A prompt encoder capable of embedding points, boxes, or masks.
A lightweight mask decoder producing pixelwise mask logits and an auxiliary IoU confidence.

A unique differentiator is the incorporation of CellFinder, a transformer-based object detector that shares the encoder weights and produces bounding box prompts for SAM. CellFinder resembles Anchor DETR, where transformer queries predict candidate cells with box coordinates and confidence scores (Israel et al., 2023, Archit et al., 18 Mar 2026). During inference, CellFinder detections above a confidence threshold (e.g., 0.4) are directly converted into box prompts for SAM, which then generates the projected masks. Non-maximum suppression is explicitly avoided to allow for overlapping (and possibly touching) instances, which are prevalent in high-density imaging.

Prompt selection is domain optimized: box prompts are shown to outperform point or mask prompts in nearly all microscopy regimes (Fig. 2b of (Israel et al., 2023)), particularly in tissue or phase-contrast images, yielding a twofold reduction in segmentation error relative to point-only strategies.

3. Variants: Dual-Backbone Fusions and Pure Zero-Shot Pipelines

Several extensions and adaptations of CellSAM have been evaluated. ScSAM (Fang et al., 23 Jul 2025) proposes fusing two frozen encoder backbones: the SAM encoder pretrained on natural images, and a masked autoencoder (MAE) trained on electron microscopy crops. This dual-stream architecture is unified via a Feature Alignment & Fusion Module (FAFM), which aligns the two encodings with small MLPs, combines them via channel attention, and further refines the fused feature space with cosine similarity–based prototype activation. This enables mask generation even in the face of extreme morphology class imbalance or highly variable background.

To adapt ScSAM to "CellSAM" for whole-cell segmentation, prototype sets are redefined for macroscopic cell classes (e.g., "cell interior," "boundary," "background"), and only the prompt encoder and mask decoder are fine-tuned using cell-level annotations, retaining both backbones frozen. This modularity enables efficient adaptation, converging in ≲50 epochs and under four hours on a single GPU, with only ~27M parameters updated (Fang et al., 23 Jul 2025).

Zero-shot strategies, such as subCellSAM (Hanimann et al., 19 Aug 2025), implement a three-stage pipeline using off-the-shelf SAM or its variants without fine-tuning. Prompt engineering is carried out by automated sampling of points and low-resolution prompt masks based on spatial and topological priors, performing iterative mask growth from nuclei seeds and using neighboring centroids to enforce repulsion and avoid instance merging. This enables accurate segmentation in data- and annotation-limited domains.

4. Training Protocols, Datasets, and Losses

Training a typical CellSAM pipeline proceeds in two phases (Israel et al., 2023):

CellFinder is trained with AdamW, using joint fine-tuning of the ViT backbone and an Anchor DETR-style transformer decoder. Input images are preprocessed by normalization and patch tiling, and batch sizes are set by GPU memory constraints.
Subsequently, the mask decoder "neck" is fine-tuned with the rest of the model frozen, taking ground-truth boxes and masks as supervision. Losses include standard detection terms (cross-entropy and L1+GIoU for boxes) and per-pixel binary cross-entropy for masks.

Training datasets cover a diverse array of public and internal resources: TissueNet, Phase400, Cellpose, DeepBacs, Omnipose, YeastNet, and DSB2018 for nuclei, among others. Preprocessing enforces consistent scaling (e.g., 512×512 or 1024×1024 patches), normalization, and channel format. The learning rate schedule, regularization, and fine-tuning epochs are stated per dataset to maximize generalizability (Israel et al., 2023, Archit et al., 18 Mar 2026).

ScSAM and its cell-level adaptation employ three loss terms: (1) cosine alignment of dual-stream features, (2) class-prototype contrastive (NTXent) loss to anchor clusters in the prompt space, and (3) per-class Dice on masks to mitigate class imbalance (Fang et al., 23 Jul 2025).

5. Quantitative Results and Benchmarking

CellSAM demonstrates consistent improvements over legacy SAM and competitive specialist models. For example, in (Israel et al., 2023):

Object-level F1, measured as the main instance segmentation metric (IoU≥0.6), is systematically higher across tissue, cell culture, bacterial, yeast, and nuclear categories when box-prompted CellSAM is used, compared to Cellpose (see Fig. 2c).
Zero-shot evaluation on unseen LIVECell images yields >4× better F1 for CellSAM than Cellpose, despite substantial label noise in LIVECell (Israel et al., 2023).

The ScSAM-to-CellSAM adaptation achieves per-class Dice ranging from ~0.77 to ~0.98 on BetaSeg subcellular benchmarks (Fang et al., 23 Jul 2025). subCellSAM, operating in true zero-shot mode, produces DSC up to 0.901 and IoU up to 0.832 on BBBC008, surpassing CellSAM (as reported in that paper, likely reflecting reproducibility and prompt optimization differences) (Hanimann et al., 19 Aug 2025).

Practical benchmarking (Fig. 2b in (Israel et al., 2023)): | Prompt Strategy | Object F1 (Tissue) | Object F1 (Cell culture) | |-------------------------------|--------------------|--------------------------| | Point (zero-shot, centroid) | Poor (low F1) | Poor (low F1) | | Box (zero-shot, CellFinder) | High (SOTA) | High (SOTA) |

A plausible implication is that, across domains, prompt selection and detector quality are primary drivers of realized mask accuracy, slightly outperforming conventional CNN-based pipelines on most benchmarks.

6. Practical Deployment, Applications, and Integration

CellSAM is designed for accessibility and throughput. It is integrated into DeepCell Label, where CellFinder proposals accelerate annotation by reducing the manual effort to 1–2 clicks per cell, halving human labor versus manual outlining (Israel et al., 2023). The codebase, dataset splits, and all model modifications are released under the MIT license, facilitating wide adoption.

The architecture is also amenable to human-in-the-loop and high-throughput deployment: on an A100 GPU, inference latency is ≈150 ms per 512×512 image for CellFinder, plus 50 ms for mask decoding for ~1,000 cells, scaling efficiently in Kubernetes clusters (Israel et al., 2023). The generalization to whole-cell and subcellular segmentation, as well as human tissue, yeast, bacteria, and even counting tasks (via the SAM-Counter adaptation (Mohammed et al., 24 Nov 2025)), demonstrates remarkable domain breadth.

7. Limitations, Open Problems, and Future Directions

Key limitations of CellSAM approaches include:

Significant performance loss in extremely high-density cell clusters (>3,000 cells/FOV), where missed instance detections or merged masks can occur (Israel et al., 2023).
Lack of native support for 3D volumetric segmentation; successful extension would require 3D ViT backbones and/or efficient attention mechanisms such as FlashAttention.
Marker and modality transferability limitations, as evidenced by the poor cross-stain generalization of SAM-Counter trained solely on DAPI (MAE jumps >600 on Cy3/AF488) (Mohammed et al., 24 Nov 2025).
Limitations of global optimization (ILP) for large histopathology images, and challenge of local intensity modeling for weak-contrast segmentation (Tyagi et al., 2023).

Proposed future work involves:

Volumetric CellSAM via Swin/Hyena 3D attention backbones and prompt-fusion for multi-channel imaging.
Vision-language CellSAM leveraging text prompts for semantic instance selection.
Few-shot adaptation and parameter-efficient fine-tuning techniques (e.g., adapters, LoRA).
Deep integration with single-cell omics assays via multimodal alignment of masks and spatial barcodes (Israel et al., 2023, Fang et al., 23 Jul 2025).

In summary, CellSAM, both as a concrete model and as an umbrella for foundation model–based cell segmentation, establishes a versatile, extensible standard capable of supporting rapid annotation, robust segmentation, and quantitative single-cell analysis across the full spectrum of modern bioimaging.