SAMannot: Efficient SAM2 Annotation Framework
- SAMannot is a modular annotation framework that integrates SAM and SAM2 for promptable, zero-shot segmentation in video, medical imaging, and scientific analysis.
- It employs a memory-conscious processing pipeline by partitioning frames into blocks and maintaining only the nearest 16 frames to optimize VRAM usage and computational efficiency.
- The framework supports a human-in-the-loop 'lock-and-refine' workflow combined with automated skeleton-based prompt propagation for consistent instance labeling and efficient export of annotations.
SAMannot refers to a suite of recent open-source frameworks and plugins that integrate the Segment Anything Model (SAM) and its successor, SAM2, into interactive, automated, and memory-efficient annotation workflows for video instance segmentation, medical image annotation, and scientific image analysis. Originating as distinct projects but sharing foundational principles, SAMannot tools address the bottleneck of high-fidelity mask annotation by blending promptable, zero-shot segmentation with efficient client-side computation, privacy preservation, and export-ready dataset construction. These frameworks enable both human-in-the-loop and fully automated annotation modes, leveraging the power of SAM2 while overcoming resource constraints typical of foundation model deployment.
1. System Architecture and Computational Principles
SAMannot, as presented in (Dinya et al., 16 Jan 2026), is architected around a modular, memory-conscious processing pipeline. The front end is implemented using Tkinter, while the back end orchestrates memory management, prompt scheduling, mask propagation, identity management, and data export. Upon loading a video or image sequence, frames are partitioned into blocks of size ; only frames and masks within the current block reside in RAM/VRAM, thus controlling memory pressure.
Crucial modifications to the SAM2 dependency target runtime state retention:
- Only latent tensors associated with the nearest 16 frames to the inference cursor are kept.
- SAM2 inference state is reinitialized for each block, preventing indefinite accumulation of attention states.
- Asynchronous, discard-on-finish frame loading further caps compute and memory usage.
For (static model VRAM), (per-frame overhead), and active block size , peak VRAM is given by
with empirically measured memory use below 3 GB for on an RTX 4090, establishing suitability for commodity GPUs (Dinya et al., 16 Jan 2026).
2. Interactive Human-in-the-Loop Workflow
SAMannot emphasizes a “lock-and-refine” human-in-the-loop workflow. Annotators interactively define segmentation prompts (point, box, or both) on frames. Once content with a frame’s segmentation, they set a checkpoint (barrier frame), after which subsequent propagation cannot overwrite that mask.
Propagation is governed by the following principle: For any propagation interval, if is a barrier frame, mask updates at and beyond that direction are forbidden. The propagation algorithm operates in unidirectional or bidirectional mode; for each label , the latest prompt before is re-used to propagate the mask via the SAM2 inference function.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
function PROPAGATE(direction, frames[1..N], labels, barriers)
if direction == "forward":
t_range ← current_frame+1…N
else:
t_range ← current_frame−1…1
for t in t_range:
if t in barriers:
break
for each label ℓ:
s ← max{k ≤ t | prompts[ℓ, k] ≠ ∅}
if s undefined: continue
P* ← prompts[ℓ, s]
M[ℓ, t] ← SAM2_INFER(frame[t], P*)
return masks M |
This construction ensures user control over mask correction, reduces manual labor by over 80%, and imposes deterministic, sequential mask consistency across frames.
3. Memory Optimization and Computational Throughput
Efficient annotation on commodity hardware is realized through multiple strategies:
- Limiting live memory to the smallest sufficient window (16 frames).
- Rewriting the SAM2 state at every block switch.
- Using asynchronous frame reads and guaranteeing minimal VRAM deltas (VRAM 1.4GB across tested sequences).
On DAVIS and LVOS datasets, per-frame compute throughput with all-label propagation reaches 5–10 fps, with end-to-end annotation for complex clips (tens to hundreds of frames and up to ten instances) requiring only 1–8 minutes per video (Dinya et al., 16 Jan 2026). The per-frame time budget is dominated by (SAM2 inference), with negligible overhead for prompt lookup and mask postprocessing.
4. Instance Identity, Prompt Automation, and Export
Persistent instance identity is handled by fixed label IDs, mapped to unique colors (e.g., PASCAL VOC colormap) to avoid merge or drift across time. No explicit tracking algorithm is used; consistency is achieved by propagating from the most recent prompt per label, never altering identity mid-propagation.
To facilitate seamless annotation across temporal blocks, the framework performs skeletonization-based prompt automation: the skeleton of mask is computed using iterative morphological thinning. Graph-based detection identifies endpoints () and junctions () within the skeleton; the coordinates of these are re-used as auto-prompts to initialize the first frame of each new annotation block.
1 |
P_{\mathrm{auto},\ell} = \{\, p\in \mathcal{S}(M_\ell)\;|\;\deg(p)=1\;\lor\;\deg(p)\ge3\}\, |
Export functionality includes:
- PNG-format per-frame masks (ID-encoded).
- YOLO-format bounding boxes.
- Session logs containing all interactions with timestamps, label, coordinates, and prompt type.
- Full session pickles for restoration.
5. Benchmarking: Accuracy, Speed, and Resource Profile
Comprehensive evaluation on DAVIS 2017 and LVOS (multi-instance video benchmarks) demonstrates the following:
- Average mean IoU: 0.9185 on DAVIS subsets; 0.8645 on LVOS subsets.
- Average mean Dice: 0.9512 (DAVIS); 0.9044 (LVOS).
- Pixel accuracy: consistently above 0.989 (Dinya et al., 16 Jan 2026).
Aggregate annotation times per video (1–8 minutes) and stable VRAM usage (3 GB) fulfill the practical requirements of research environments handling privacy-sensitive data or large-scale annotation campaigns.
| Sequence | Frames | Instances | Mean IoU | Mean Dice | Pixel Accuracy |
|---|---|---|---|---|---|
| Rhino (DAVIS) | 90 | 1 | 0.9807 | 0.9902 | 0.9968 |
| Schoolgirls | 80 | 7 | 0.7473 | 0.8214 | 0.9896 |
| Average | 76 | — | 0.9185 | 0.9512 | 0.9893 |
[Adapted from (Dinya et al., 16 Jan 2026) Table 1]
6. Context: Comparison to Other SAM-based Annotation Tools
Parallel developments have adapted SAM and SAM2 for related domains:
- SAMJ (“SAMannot” on Fiji/ImageJ) integrates SAM into a Java-native plugin with single-click install, promptable segmentation (point/rectangle), and batch workflows for large scientific images. Achieves per-object annotation at 0.1s/object and IoU on well-scaled objects. Supports export to standard ROI, JSON, and XML formats, and is 100 faster than manual methods (Garcia-Lopez-de-Haro et al., 3 Jun 2025).
- SegmentWithSAM (3D Slicer extension) incorporates both SAM and SAM2 for 2D and 3D medical image annotation. Users place prompts on one or more slices; masks propagate across entire volumes using SAM2’s video predictor. Achieves Dice scores above 0.85 with fewer than five prompts per volume, with outputs exportable as NRRD or DICOM-SEG (Yildiz et al., 2024).
- SAM2Auto fuses automatic, open-vocabulary detection (SMART-OD) with memory-based video instance segmentation and tracking (FLASH) for unsupervised, large-scale video annotation, supporting metric-driven error minimization and no human supervision (Rocky et al., 9 Jun 2025).
A plausible implication is that the SAMannot toolkit family represents a versatile set of privacy-conscious, client-side, and domain-agnostic solutions able to serve biomedical, behavioral, and computer vision annotation needs.
7. Limitations and Future Directions
Key limitations include:
- SAMannot and related tools depend on the intrinsic segmentation capacity of SAM2 and may struggle with visually ambiguous, crowded, or low-contrast scenes.
- Fine structural accuracy in vascular and low-contrast regions remains challenging, necessitating manual correction (Yildiz et al., 2024).
- No explicit tracking is implemented; future variants may incorporate tracklet-based identity confirmation (e.g., via mask IoU thresholding).
Future directions indicated in the literature include:
- Extending propagation to support box and mask prompts in 3D sequences.
- Integrating domain-adaptive fine-tuning (e.g., low-rank adaptation of SAM2) to specialized datasets.
- Formal benchmarking across public datasets and further reduction in user prompt requirements.
In summary, SAMannot frameworks demonstrate that integrating SAM2-based promptable instance segmentation into memory-aware, interactive, and automated pipelines yields efficient, accurate, and reproducible annotation for a wide spectrum of research verticals, substantially reducing annotation time while maintaining local privacy and export flexibility (Dinya et al., 16 Jan 2026, Garcia-Lopez-de-Haro et al., 3 Jun 2025, Yildiz et al., 2024).