FastSAM: Fast Segment Anything
- The paper introduces FastSAM, which achieves real-time segmentation by replacing SAM’s prompt-conditioned transformer with a two-stage instance segmentation pipeline.
- It leverages a YOLOv8-seg backbone and prompt-guided mask selection to efficiently handle both 2D images and 3D medical volumes.
- Benchmark results show significant speed-up and a reduction in parameters, though fine-grained mask accuracy is traded off compared to the original SAM design.
Fast Segment Anything (FastSAM) denotes a family of architectures and workflows focused on accelerating the “segment anything” paradigm through radical simplification and architectural replacement of the original SAM design. FastSAM achieves real-time or near-real-time instance and promptable segmentation via two core instantiations: (1) an image segmentation model that eschews prompt-conditioned transformers in favor of a two-stage instance segmentation and mask selection pipeline, and (2) a volumetric variant, FastSAM-3D, for efficient, promptable 3D medical image segmentation deployed in FastSAM-3DSlicer. The following sections survey the full technical landscape, positioning FastSAM and FastSAM-3D in the broader context of foundation segmenters, highlighting architectural trade-offs, efficiency, accuracy, and workflow integration (Zhao et al., 2023, Sun et al., 2024, Shen et al., 2024).
1. Architectural Rationale and Core Problem Reformulation
FastSAM abandons the original SAM’s prompt-conditioned transformer design in favor of a two-stage instance segmentation and prompt-guided mask selection formulation. Instead of using a ViT-based encoder and promptable transformer decoder, FastSAM performs:
- All-Instance Segmentation: The image is passed through a CNN-based instance segmentation network (e.g., YOLOv8-seg), generating candidate masks, bounding boxes, and confidence scores for all visible objects or regions. This stage is class-agnostic and produces a dense set of candidate masks in a single forward pass.
- Prompt-Guided Selection: Given a user prompt (point, box, or text), relevant instance mask(s) are selected post hoc, typically by geometric or embedding-based matching (point-in-mask logic, maximized box IoU, or CLIP similarity for text prompts). The prompt handling thus becomes an efficient selection/filtering operation, decoupled from heavy recomputation (Zhao et al., 2023, Sun et al., 2024).
This decomposition is summarized by
yielding both the “SegAny” (promptable segmentation) and “SegEvery” (segment everything) workflows.
2. Pipeline and Network Implementation
2.1. Image Segmentation Backbone
The primary FastSAM variant is realized via a YOLOv8-seg architecture, an anchor-free instance segmentation detector built to output:
- Detection Branch: bounding boxes, class labels, and detection confidence
- Segmentation Branch: prototype masks (default ) and corresponding per-instance coefficients
Each instance mask is constructed via
where are prototype masks and are predicted coefficients for instance . This design, inspired by YOLACT, allows the entire set of candidate masks to be computed in a single inference step with subsequent mask selection requiring negligible additional computation (Zhao et al., 2023, Sun et al., 2024).
2.2. Prompt Types and Selection Logic
Prompt inputs supported natively include:
- Point Prompts: Foreground/background points guide selection via inclusion/exclusion in predicted masks.
- Box Prompts: Mask selection by maximizing box-mask IoU with the prompt bounding box.
- Text Prompts: Masks are matched to CLIP text embeddings for semantic promptability (at additional computational cost).
Prompt handling operates as post-processing, not end-to-end joint learning as in SAM. This approach yields a runtime that is effectively independent of prompt count, in contrast to SAM’s prompt-conditioned transformer decoding (Zhao et al., 2023).
2.3. Volumetric FastSAM-3D
For medical imaging, FastSAM-3D adapts the SAM paradigm for volumetric images:
- Architecture: Distilled from SAM-Med3D and built around a compact ViT encoder with 3D Sparse Flash Attention for efficient volume-level processing.
- Workflow: Fully 3D operation with promptable segmentation at the volume level (not slice-by-slice), supporting rapid, interactive segmentation in practical clinical workflows (Shen et al., 2024).
- Integration: FastSAM-3D is embedded in the FastSAM-3DSlicer extension, providing seamless 3D interaction, real-time feedback, and automated workflow steps on platforms such as 3D Slicer.
3. Training Strategies and Data Efficiency
FastSAM is trained as an instance segmentation detector on the SA-1B dataset, using only a 2% subsample ($1/50$ of the data):
The network minimizes
0
where 1 includes classification, box, and confidence losses, and 2 supervises mask prediction. Notably, promptable segmentation is recovered from class-agnostic training by enabling selection logic at inference, thus streamlining both annotation requirements and network capacity (Zhao et al., 2023).
4. Efficiency Benchmarks and Performance Trade-offs
FastSAM achieves substantial gains in runtime, parameter count, and resource utilization compared to vanilla SAM. Major results include:
| Model | Params | FLOPs | SegAny mIoU (COCO) | SegAny Latency (3090) | Memory |
|---|---|---|---|---|---|
| SAM-H | 641M | 5490G | 77.4 (box1 mIoU) | 461 ms | 7.46GB |
| SAM-B | 94M | 746G | 75.1 | 1383 ms | 4.39GB |
| FastSAM | 68M | 888G | 60.5 | 103 ms | 4.96GB |
FastSAM is 3 faster than SAM-H (SegAny segmentation on 3090: 103 ms vs 461 ms) and 4 faster in SegEvery equivalent settings. Parameters are reduced by 589\% vs SAM-H (Sun et al., 2024).
However, accuracy trade-offs are marked: on COCO, FastSAM’s box mIoU for SegAny is 6 (vs 7 for SAM-H) and AP for instance segmentation is consistently lower, especially for small objects and fine boundary details. The performance gap is most pronounced for promptable segmentation tasks; FastSAM’s most competitive results are in bounding box proposal recall, where its box-based detection head yields direct performance benefits (Sun et al., 2024, Zhao et al., 2023).
For volumetric segmentation, FastSAM-3D achieves:
- 8 seconds per volume on CPU
- 9 seconds per volume on GPU
on midrange hardware (AMD Ryzen 5 5500U, RTX 2060), outperforming other 3D models for full-volume medical image segmentation tasks (Shen et al., 2024).
5. Uncertainty Quantification in FastSAM-3D
A key innovation in FastSAM-3D is a practical uncertainty quantification method:
- The encoder is run once on the input volume to yield an embedding.
- The decoder is run 0 times with sampled pseudo-prompts derived from the current segmentation mask.
- The ensemble of predicted logits 1 is averaged to yield a consensus mask:
2
- Voxel-wise uncertainty is estimated by the standard deviation across the ensemble:
3
This quantification enables users to target high-uncertainty regions for additional prompts, improving accuracy with minimal redundant interaction. Notably, this approach is operationally practical due to FastSAM-3D’s rapid decoder speed and precomputed embeddings (Shen et al., 2024).
6. Clinical and Research Workflow Integration
FastSAM-3D is distributed as part of FastSAM-3DSlicer, offering the following workflow automation:
- Data Import: Preparation of DICOM/NIfTI data, automatic volume and segmentation node creation.
- Prompt Management: Interactive point placement, coordinate conversion, and prompt accumulation.
- Model Selection: Seamless switching among 2D/3D SAM variants.
- Inference and Visualization: Automatic resizing, real-time masking, uncertainty map display, and mask export.
- Automation: Dependency installation, user-oriented interface, and latent state handling.
This platform-level integration streamlines medical image segmentation, eliminates manual pre-processing, and supports in situ human-in-the-loop refinement (Shen et al., 2024).
7. Limitations, Design Trade-offs, and Open Directions
Key limitations of FastSAM and its derivatives include:
- Quality vs Speed: Marked drop in promptable mask accuracy relative to SAM, especially for small/fine-grained objects.
- Heuristic Prompt Handling: Selection logic may not fully exploit spatial or semantic context compared to learned prompt-conditioned decoding.
- Boundary Coarseness: Prototype mask mechanisms yield less smooth boundaries than transformer-based decoders.
- Specialization vs Generality: The instance segmentation paradigm is well-matched to “segment everything” and fast industrial tasks, but less flexible for open-vocabulary, language-driven, or highly prompt-adaptive workflows.
- Medical Segmentation Metrics: The FastSAM-3DSlicer integration reports mainly on speed and qualitative performance; comprehensive metrics (Dice, IoU, HD95) are not detailed in the design papers (Shen et al., 2024).
Enhancements in calibration, prompt recommendation, backbone refinement, and multi-modal prompt support represent active research directions. Subsequent efficient SAM variants have further optimized the speed–accuracy trade-off through hybrid architectures, attention reparameterization, and model compression (Sun et al., 2024).
FastSAM stands as a speed-centric alternative to transformer-based segmentation foundation models, demonstrating that the “segment anything” capability can be captured through fast, detector-driven instance mask pipelines coupled with flexible prompt-conditioned mask selection. In medical imaging, FastSAM-3D extends these principles to volumetric data with real-time interaction and efficient uncertainty mechanisms, concretely advancing practical segmentation deployment in both research and clinical environments.