EfficientViT-SAM: Accelerated SAM via EfficientViT

Updated 17 April 2026

EfficientViT-SAM is a segmentation model that swaps the heavy ViT encoder with an EfficientViT backbone, drastically reducing compute and memory while maintaining high accuracy.
It leverages innovations like ReLU-linear attention and multi-scale token aggregation to achieve up to 48.9× speedup and competitive metrics on datasets such as COCO and LVIS.
The plug-and-play integration with SAM’s prompt and mask decoders, along with a two-phase training regime using knowledge distillation, ensures efficient yet robust performance.

EfficientViT-SAM refers to a class of segment anything models (SAMs) that accelerate and compress the original SAM architecture by replacing the heavy Vision Transformer (ViT) image encoder with an EfficientViT backbone, while preserving prompt encoding and mask decoding pipelines. Through architectural innovations and knowledge distillation, EfficientViT-SAM models achieve substantial inference speedups, dramatic reductions in memory and compute requirements, and in many scenarios, equivalent or superior segmentation accuracy compared to traditional SAMs.

1. Background: Segment Anything Models and Motivation for Efficiency

The Segment Anything Model (SAM) combines a powerful ViT-based image encoder, prompt encoder, and lightweight mask decoder to enable zero-shot and interactive segmentation of arbitrary objects and classes. The default image encoder, a ViT-H backbone, is expensive—requiring 2973 GMACs per 1024×1024 image and hundreds of millions of parameters. This computational burden hinders deployment in real-time, mobile, or resource-constrained scenarios (Zhang et al., 2024).

EfficientViT-SAM addresses this limitation by leveraging the EfficientViT family of backbone architectures—transformers specifically engineered for dense prediction with strict compute and latency requirements—enabling up to 48.9× measured speedup on an A100 GPU over baseline SAM-ViT-H with no drop in segmentation accuracy (Zhang et al., 2024, Cai et al., 2022).

2. EfficientViT Backbone Architecture

EfficientViT is a transformer model designed for high efficiency on high-resolution dense prediction tasks (Cai et al., 2022). The key architectural features are:

ReLU-Linear Attention (LRA):

Classical attention involves computing $A(Q,K,V) = \mathrm{softmax}(QK^\top/\sqrt d)V$ , which scales quadratically with token count. EfficientViT replaces softmax attention with ReLU-linear attention:

$A_{\mathrm{LRA}}(Q,K,V) = (\mathrm{ReLU}(Q))\big(\mathrm{ReLU}(K)^\top V\big)$

allowing associative reordering and reducing cost from $O(N^2)$ to $O(N)$ .

Multi-Scale Token Aggregation:

Each attention head aggregates features both locally and globally via depthwise-separable convolutions at multiple scales (e.g., $1\times1$ and $5\times5$ kernels), followed by concatenation and fusion.

FFN with Local Token Mixing:

The MLP of each transformer block interleaves ReLU activation and a small depthwise convolution for spatial context.

Stage-wise Macro-architecture:

Early stages utilize convolutional ResBlocks for rapid spatial reduction, while later stages use EfficientViT modules at lower resolution but higher channel width.

The computational savings are reflected in FLOP counts. EfficientViT’s block costs $\sim O(Nd^2 + Nd s^2)$ , substantially lower than $O(N^2d)$ for standard transformers, enabling up to $10-30\times$ speedup for large $N$ (Cai et al., 2022).

3. Integration with SAM: EfficientViT-SAM Pipeline

EfficientViT-SAM retains the original SAM prompt encoder and mask decoder. The image encoder is replaced with an EfficientViT model (e.g., L0, L1, L2, XL0, XL1 variants), whose output is dimensionally matched to the mask decoder ( $A_{\mathrm{LRA}}(Q,K,V) = (\mathrm{ReLU}(Q))\big(\mathrm{ReLU}(K)^\top V\big)$ 0). Integration is plug-and-play: only the backbone swap is necessary, leaving SAM-specific modules unchanged (Zhang et al., 2024, Cai et al., 2022). The forward pass remains:

Image $A_{\mathrm{LRA}}(Q,K,V) = (\mathrm{ReLU}(Q))\big(\mathrm{ReLU}(K)^\top V\big)$ 1 EfficientViT $A_{\mathrm{LRA}}(Q,K,V) = (\mathrm{ReLU}(Q))\big(\mathrm{ReLU}(K)^\top V\big)$ 2 feature map
Prompts $A_{\mathrm{LRA}}(Q,K,V) = (\mathrm{ReLU}(Q))\big(\mathrm{ReLU}(K)^\top V\big)$ 3 prompt encoder $A_{\mathrm{LRA}}(Q,K,V) = (\mathrm{ReLU}(Q))\big(\mathrm{ReLU}(K)^\top V\big)$ 4 prompt tokens
Features + prompts $A_{\mathrm{LRA}}(Q,K,V) = (\mathrm{ReLU}(Q))\big(\mathrm{ReLU}(K)^\top V\big)$ 5 mask decoder $A_{\mathrm{LRA}}(Q,K,V) = (\mathrm{ReLU}(Q))\big(\mathrm{ReLU}(K)^\top V\big)$ 6 predicted masks

The design allows full compatibility with existing SAM inference and training code.

4. Training Protocol and Knowledge Distillation

EfficientViT-SAM is trained in a two-phase fashion:

Phase 1: Knowledge Distillation:

The EfficientViT backbone (student) is initialized by minimizing L2 distance to the output features of a pretrained SAM-ViT-H encoder (teacher) on the input images:

$A_{\mathrm{LRA}}(Q,K,V) = (\mathrm{ReLU}(Q))\big(\mathrm{ReLU}(K)^\top V\big)$ 7

Optionally auxiliary losses on intermediate features or softened logits may be included.

Phase 2: End-to-End Fine-Tuning:

The entire SAM pipeline (EfficientViT image encoder, frozen prompt encoder, mask decoder) is then fine-tuned on the full SA-1B segmentation dataset with the standard mask prediction loss (20:1 blend of focal and dice losses):

$A_{\mathrm{LRA}}(Q,K,V) = (\mathrm{ReLU}(Q))\big(\mathrm{ReLU}(K)^\top V\big)$ 8

Freeze/unfreeze schedules are used to stabilize training. Hyperparameters include AdamW optimizer, large batch sizes (up to 256), cosine learning rate decay, and simple augmentations (Zhang et al., 2024).

5. Empirical Results and Efficiency Metrics

EfficientViT-SAM achieves dramatic improvements in both speed and efficiency:

Model	Params	MACs	Throughput (A100)	COCO mAP (box prompt)	LVIS mAP
SAM-ViT-Huge	641 M	2973 G	11 img/s	46.5	44.2
EfficientViT-SAM-L0	35 M	35 G	762 img/s	45.7	41.8
EfficientViT-SAM-L2	61 M	69 G	538 img/s	46.6	42.7
EfficientViT-SAM-XL1	203 M	322 G	182 img/s	47.8	44.4

Zero-shot accuracy, as measured by COCO mAP and LVIS mAP, matches or exceeds that of SAM-ViT-Huge, with speedups of up to $A_{\mathrm{LRA}}(Q,K,V) = (\mathrm{ReLU}(Q))\big(\mathrm{ReLU}(K)^\top V\big)$ 9. For example, EfficientViT-SAM-XL1 achieves 47.8 mAP on COCO (vs. 46.5 for baseline) (Zhang et al., 2024, Cai et al., 2022).

Point- and box-prompted mIoU are also improved or preserved, e.g., 1-click COCO mIoU 59.8 (EfficientViT-SAM-XL1) vs. 58.4 (SAM-ViT-H) (Zhang et al., 2024).

6. Variants, Applications, and Extensions

EfficientViT-SAM has enabled new efficient SAM variants and inspired further research:

Resource-Constrained Deployment:

Edge and mobile device applications benefit from the reduced parameter count and runtime, enabling real-time segmentation for inputs up to $O(N^2)$ 0 pixels (Zhang et al., 2024).

Depth-Aware Fusion:

Extensions augment EfficientViT-SAM with depth priors, fusing mid-level RGB and depth features to improve segmentation with limited data, at the cost of increased parameter count and halved throughput. For example, point-prompted COCO mIoU rises from 67.2% (EfficientViT-SAM-L2) to 71.6% with depth fusion (Zhou et al., 12 Feb 2026).

One-Shot Architecture Search and Pruning:

Methods such as SuperSAM apply structured pruning and unstructured parameter prioritization (Wanda scoring and MLP slicing) to produce customized EfficientViT-SAM subnetworks, achieving up to 70% size reductions with maintained or improved mIoU (Abebe et al., 15 Jan 2025).

Medical Imaging and 3D Volumetric Segmentation:

Specialized EfficientViT-SAM designs (sometimes referred to as “ViT-Tiny” or “FastSAM3D”) adapt the encoder with fewer layers and 3D sparse flash attention, achieving 527× inference speedup over naive 2D SAM approaches with minimal performance loss for interactive 3D medical segmentation (Shen et al., 2024).

7. Limitations and Prospects

EfficientViT-SAM preserves the mask decoder and prompt encoder from the baseline SAM, focusing all efficiency gains on the image encoder. While this results in substantial improvements, further reductions could be obtained by elasticizing or pruning the prompt and mask components (Abebe et al., 15 Jan 2025).

Depth-aware variants increase model size and decrease throughput, and reliance on frozen old depth estimators may limit robustness (Zhou et al., 12 Feb 2026). Knowledge distillation performance is highly sensitive to loss choices; direct logit-level matching may fail to converge (Shen et al., 2024). At extreme parameter reductions (below ~30M), substantial mIoU degradation is observed (Abebe et al., 15 Jan 2025).

Future research directions include:

End-to-end NAS with learned pruning masks,
Dynamic window sizes for MLP slicing,
Joint multimodal training of RGB and depth cues,
Lightweight prompt and mask decoder designs,
Hardware-specific kernel fusion for deployment.

References

"EfficientViT-SAM: Accelerated Segment Anything Model Without Accuracy Loss" (Zhang et al., 2024)
"EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction" (Cai et al., 2022)
"Efficient Segment Anything with Depth-Aware Fusion and Limited Training Data" (Zhou et al., 12 Feb 2026)
"SuperSAM: Crafting a SAM Supernetwork via Structured Pruning and Unstructured Parameter Prioritization" (Abebe et al., 15 Jan 2025)
"FastSAM3D: An Efficient Segment Anything Model for 3D Volumetric Medical Images" (Shen et al., 2024)