Papers
Topics
Authors
Recent
Search
2000 character limit reached

FastSAM: Fast Segment Anything

Updated 3 July 2026
  • The paper introduces FastSAM, which achieves real-time segmentation by replacing SAM’s prompt-conditioned transformer with a two-stage instance segmentation pipeline.
  • It leverages a YOLOv8-seg backbone and prompt-guided mask selection to efficiently handle both 2D images and 3D medical volumes.
  • Benchmark results show significant speed-up and a reduction in parameters, though fine-grained mask accuracy is traded off compared to the original SAM design.

Fast Segment Anything (FastSAM) denotes a family of architectures and workflows focused on accelerating the “segment anything” paradigm through radical simplification and architectural replacement of the original SAM design. FastSAM achieves real-time or near-real-time instance and promptable segmentation via two core instantiations: (1) an image segmentation model that eschews prompt-conditioned transformers in favor of a two-stage instance segmentation and mask selection pipeline, and (2) a volumetric variant, FastSAM-3D, for efficient, promptable 3D medical image segmentation deployed in FastSAM-3DSlicer. The following sections survey the full technical landscape, positioning FastSAM and FastSAM-3D in the broader context of foundation segmenters, highlighting architectural trade-offs, efficiency, accuracy, and workflow integration (Zhao et al., 2023, Sun et al., 2024, Shen et al., 2024).

1. Architectural Rationale and Core Problem Reformulation

FastSAM abandons the original SAM’s prompt-conditioned transformer design in favor of a two-stage instance segmentation and prompt-guided mask selection formulation. Instead of using a ViT-based encoder and promptable transformer decoder, FastSAM performs:

  1. All-Instance Segmentation: The image is passed through a CNN-based instance segmentation network (e.g., YOLOv8-seg), generating candidate masks, bounding boxes, and confidence scores for all visible objects or regions. This stage is class-agnostic and produces a dense set of candidate masks in a single forward pass.
  2. Prompt-Guided Selection: Given a user prompt (point, box, or text), relevant instance mask(s) are selected post hoc, typically by geometric or embedding-based matching (point-in-mask logic, maximized box IoU, or CLIP similarity for text prompts). The prompt handling thus becomes an efficient selection/filtering operation, decoupled from heavy recomputation (Zhao et al., 2023, Sun et al., 2024).

This decomposition is summarized by

Segment AnythingAll-Instance Segmentation+Prompt-Guided Selection\text{Segment Anything} \longrightarrow \text{All-Instance Segmentation} + \text{Prompt-Guided Selection}

yielding both the “SegAny” (promptable segmentation) and “SegEvery” (segment everything) workflows.

2. Pipeline and Network Implementation

2.1. Image Segmentation Backbone

The primary FastSAM variant is realized via a YOLOv8-seg architecture, an anchor-free instance segmentation detector built to output:

  • Detection Branch: bounding boxes, class labels, and detection confidence
  • Segmentation Branch: kk prototype masks (default k=32k=32) and corresponding per-instance coefficients

Each instance mask mim_i is constructed via

mi=j=1kαijPjm_i = \sum_{j=1}^{k} \alpha_{ij} P_j

where PjP_j are prototype masks and αij\alpha_{ij} are predicted coefficients for instance ii. This design, inspired by YOLACT, allows the entire set of candidate masks to be computed in a single inference step with subsequent mask selection requiring negligible additional computation (Zhao et al., 2023, Sun et al., 2024).

2.2. Prompt Types and Selection Logic

Prompt inputs supported natively include:

  • Point Prompts: Foreground/background points guide selection via inclusion/exclusion in predicted masks.
  • Box Prompts: Mask selection by maximizing box-mask IoU with the prompt bounding box.
  • Text Prompts: Masks are matched to CLIP text embeddings for semantic promptability (at additional computational cost).

Prompt handling operates as post-processing, not end-to-end joint learning as in SAM. This approach yields a runtime that is effectively independent of prompt count, in contrast to SAM’s prompt-conditioned transformer decoding (Zhao et al., 2023).

2.3. Volumetric FastSAM-3D

For medical imaging, FastSAM-3D adapts the SAM paradigm for volumetric images:

  • Architecture: Distilled from SAM-Med3D and built around a compact ViT encoder with 3D Sparse Flash Attention for efficient volume-level processing.
  • Workflow: Fully 3D operation with promptable segmentation at the volume level (not slice-by-slice), supporting rapid, interactive segmentation in practical clinical workflows (Shen et al., 2024).
  • Integration: FastSAM-3D is embedded in the FastSAM-3DSlicer extension, providing seamless 3D interaction, real-time feedback, and automated workflow steps on platforms such as 3D Slicer.

3. Training Strategies and Data Efficiency

FastSAM is trained as an instance segmentation detector on the SA-1B dataset, using only a 2% subsample ($1/50$ of the data):

DFastSAM=150DSA-1B|\mathcal{D}_{\text{FastSAM}}| = \frac{1}{50} |\mathcal{D}_{\text{SA-1B}}|

The network minimizes

kk0

where kk1 includes classification, box, and confidence losses, and kk2 supervises mask prediction. Notably, promptable segmentation is recovered from class-agnostic training by enabling selection logic at inference, thus streamlining both annotation requirements and network capacity (Zhao et al., 2023).

4. Efficiency Benchmarks and Performance Trade-offs

FastSAM achieves substantial gains in runtime, parameter count, and resource utilization compared to vanilla SAM. Major results include:

Model Params FLOPs SegAny mIoU (COCO) SegAny Latency (3090) Memory
SAM-H 641M 5490G 77.4 (box1 mIoU) 461 ms 7.46GB
SAM-B 94M 746G 75.1 1383 ms 4.39GB
FastSAM 68M 888G 60.5 103 ms 4.96GB

FastSAM is kk3 faster than SAM-H (SegAny segmentation on 3090: 103 ms vs 461 ms) and kk4 faster in SegEvery equivalent settings. Parameters are reduced by kk589\% vs SAM-H (Sun et al., 2024).

However, accuracy trade-offs are marked: on COCO, FastSAM’s box mIoU for SegAny is kk6 (vs kk7 for SAM-H) and AP for instance segmentation is consistently lower, especially for small objects and fine boundary details. The performance gap is most pronounced for promptable segmentation tasks; FastSAM’s most competitive results are in bounding box proposal recall, where its box-based detection head yields direct performance benefits (Sun et al., 2024, Zhao et al., 2023).

For volumetric segmentation, FastSAM-3D achieves:

  • kk8 seconds per volume on CPU
  • kk9 seconds per volume on GPU

on midrange hardware (AMD Ryzen 5 5500U, RTX 2060), outperforming other 3D models for full-volume medical image segmentation tasks (Shen et al., 2024).

5. Uncertainty Quantification in FastSAM-3D

A key innovation in FastSAM-3D is a practical uncertainty quantification method:

  1. The encoder is run once on the input volume to yield an embedding.
  2. The decoder is run k=32k=320 times with sampled pseudo-prompts derived from the current segmentation mask.
  3. The ensemble of predicted logits k=32k=321 is averaged to yield a consensus mask:

k=32k=322

  1. Voxel-wise uncertainty is estimated by the standard deviation across the ensemble:

k=32k=323

This quantification enables users to target high-uncertainty regions for additional prompts, improving accuracy with minimal redundant interaction. Notably, this approach is operationally practical due to FastSAM-3D’s rapid decoder speed and precomputed embeddings (Shen et al., 2024).

6. Clinical and Research Workflow Integration

FastSAM-3D is distributed as part of FastSAM-3DSlicer, offering the following workflow automation:

  • Data Import: Preparation of DICOM/NIfTI data, automatic volume and segmentation node creation.
  • Prompt Management: Interactive point placement, coordinate conversion, and prompt accumulation.
  • Model Selection: Seamless switching among 2D/3D SAM variants.
  • Inference and Visualization: Automatic resizing, real-time masking, uncertainty map display, and mask export.
  • Automation: Dependency installation, user-oriented interface, and latent state handling.

This platform-level integration streamlines medical image segmentation, eliminates manual pre-processing, and supports in situ human-in-the-loop refinement (Shen et al., 2024).

7. Limitations, Design Trade-offs, and Open Directions

Key limitations of FastSAM and its derivatives include:

  • Quality vs Speed: Marked drop in promptable mask accuracy relative to SAM, especially for small/fine-grained objects.
  • Heuristic Prompt Handling: Selection logic may not fully exploit spatial or semantic context compared to learned prompt-conditioned decoding.
  • Boundary Coarseness: Prototype mask mechanisms yield less smooth boundaries than transformer-based decoders.
  • Specialization vs Generality: The instance segmentation paradigm is well-matched to “segment everything” and fast industrial tasks, but less flexible for open-vocabulary, language-driven, or highly prompt-adaptive workflows.
  • Medical Segmentation Metrics: The FastSAM-3DSlicer integration reports mainly on speed and qualitative performance; comprehensive metrics (Dice, IoU, HD95) are not detailed in the design papers (Shen et al., 2024).

Enhancements in calibration, prompt recommendation, backbone refinement, and multi-modal prompt support represent active research directions. Subsequent efficient SAM variants have further optimized the speed–accuracy trade-off through hybrid architectures, attention reparameterization, and model compression (Sun et al., 2024).


FastSAM stands as a speed-centric alternative to transformer-based segmentation foundation models, demonstrating that the “segment anything” capability can be captured through fast, detector-driven instance mask pipelines coupled with flexible prompt-conditioned mask selection. In medical imaging, FastSAM-3D extends these principles to volumetric data with real-time interaction and efficient uncertainty mechanisms, concretely advancing practical segmentation deployment in both research and clinical environments.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Fast Segment Anything (FastSAM).