Segment Anything Video (SA-V) Dataset
- The dataset introduces promptable open-world video segmentation with masklets, combining human-in-the-loop verification and rapid SAM 2-assisted annotation.
- SA-V comprises 50,900 diverse videos with extensive spatio-temporal mask annotations, enabling detailed analysis across indoor and outdoor scenes.
- The underlying streaming-memory transformer leverages ViT features and memory attention to ensure efficient, consistent mask propagation and robust tracking.
The Segment Anything Video (SA-V) Dataset is the largest publicly available resource for promptable open-world video segmentation, designed to advance research in visual segmentation tasks across diverse domains. Developed via a human-in-the-loop data engine built around the streaming-memory transformer SAM 2, SA-V provides spatio-temporal mask tracks ("masklets") for arbitrary objects and object parts, annotated and quality-assured at scale. The dataset and model are fully open, distributed under permissive licenses to catalyze innovation in video understanding.
1. Composition and Structure
SA-V comprises 50,900 in-the-wild videos (54% indoor, 46% outdoor) sourced by crowd workers from 47 countries. These videos total 196 hours and approximately 4.2 million frames, sampled at 24 FPS, with an average length of about 14 seconds per clip. Video resolution spans from 240p to 4K, averaging 1401×1037 pixels. Segmentation coverage includes 642,600 masklets representing spatio-temporal tracks; 190,900 are manually annotated, while 451,700 are generated automatically and then verified by human annotators. Masks are subsampled at 6 FPS, yielding 35.5 million total per-frame masks (over 10 million manual-only). Object coverage is unconstrained—annotators may select any object or part with visible and clear spatial boundaries.
The mask size distribution reveals that over 88% of masks occupy less than 10% of the frame area. Disappearance and reappearance events, which are significant for temporally consistent tracking, are observed in 42.5% of manual masklets and 27.7% overall.
2. Annotation Protocols and Quality Control
Annotations are organized as masklets, each represented by per-frame binary masks stored in run-length encoded JSON or lossless PNG formats. The masking process is rooted in “promptable segmentation”: annotators supply positive/negative clicks, bounding boxes, or scribbled masks as interaction cues. On average, 2.68 clicks are required for each edited frame. Corrections and prompting occur on roughly 19% of frames within each masklet, reflecting the semi-automatic nature of mask propagation.
A three-phase annotation protocol increases speed and quality:
- Phase 1: Manual brushing using SAM on every frame (37.8 sec/frame).
- Phase 2: Annotate the initial frame, propagate masks with SAM 2 Mask, then correct as needed (7.4 sec/frame).
- Phase 3: Full SAM 2 with memory enables annotation of the first frame via click/box/mask and efficient propagation with minimal edits (4.5 sec/frame, 19% edited frames).
Quality control mandates dual-team verification. Masklets are marked "satisfactory" if objects are tracked consistently and completely; unsatisfactory tracks are refined or discarded. Object boundaries may include both full objects and parts as long as they are clearly delineated. The “Mask Alignment Score” (MAS) quantifies per-masklet consistency:
3. Data Engine Workflow and Streaming-Memory Model
SA-V’s collection pipeline uses a model-in-the-loop workflow:
- Interactive Phases: Phase 1 (manual+SAM annotation) yields 16,000 masklets on 1,400 videos; Phase 2 (SAM+SAM 2 Mask) produces 63,500 masklets (5× speedup); Phase 3 (fully interactive SAM 2 with memory) delivers 197,000 masklets (8.4× speedup).
- Automatic Masklets: Grid-of-clicks initialization (32×32, 16×16, 4×4 crops) prompts SAM 2 to generate masklets, which undergo verification before inclusion.
The underlying model is a streaming-memory transformer:
- For frames, image features are extracted (ViT via Hiera MAE pretraining).
- Memory attention incorporates recent spatial memories ( FIFO frames), object pointers ( prompted frames), and cross-attention from prompts.
- The mask decoder yields masks, occlusion confidence, and predicted IoU for each frame. The highest-IoU mask propagates forward. Object pointers are 32–64 dimensional tokens from decoder heads.
- Prompts support direct click, box, or mask embeddings; occlusion prediction addresses visibility across temporal gaps.
Key scoring formulas mirror standard segmentation metrics:
- Mean IoU for images:
- Mask Alignment (see above)
Automatic masks undergo filtering (components <200 pixels removed, holes <200 pixels filled) for spatial coherence.
4. Dataset Splits and Demographics
SA-V is partitioned as follows:
- Training: ~50,600 videos, all masklets minus val/test samples.
- Validation: 155 challenging videos, 293 manually verified masklets (sampled at 6 FPS).
- Test: 150 videos, 278 masklets.
Splits are determined by author and geography to prevent near-duplication. Annotator demographics: 274 self-reported male, 236 female; ages 18–24 (109), 25–40 (305), 41–64 (88). Fairness evaluation using Ego-Exo4D reveals less than 1 point gap in (IoU) scores across gender and age for 3-click and mask prompts. Object categories are open-world; annotation is not constrained by predefined taxonomy.
5. File Formats and Organization
SA-V maintains a rigorous directory and file structure:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
Root/
videos/
{video_id}/
video.mp4
frames/{000001.jpg, ...}
annotations/
masklets_train.json
masklets_val.json
masklets_test.json
masklet_schema.json
metadata/
videos.csv # video_id, fps, resolution, length_sec, location
annot_stats.csv # masklet_id, type, num_frames, avg_mask_area |
Each masklets_*.json is COCO-style, with the following fields:
| Field | Type/Format | Description |
|---|---|---|
| masklet_id | string/int | Unique masklet track identifier |
| video_id | string/int | Video key |
| frames | list | Frame indices or names |
| masks | list (RLE/PNG) | Dense per-frame binary masks |
| prompts | list | Each: frame, list of clicks (x,y,±), type |
| quality | string | 'verified' or 'unsatisfactory' |
Supporting metadata links masklets to video sources and annotator statistics.
6. Licensing, Release, and Benchmarking
SA-V is distributed under the Creative Commons Attribution 4.0 (CC BY 4.0) license. All model code and weights for SAM 2 are released under Apache 2.0. The dataset, codebase, and interactive demo are available at:
- Dataset: https://ai.meta.com/datasets/segment-anything-video
- Model/code: https://github.com/facebookresearch/segment-anything-2
- Demo: https://sam2.metademolab.com
Benchmarking is comprehensive:
- Zero-shot promptable segmentation: 9 video datasets (e.g., EndoVis, VIPSeg), 3-click protocol, reporting (IoU) and (contour) per DAVIS standards.
- Semi-supervised video object segmentation (VOS): 17 datasets, prompts by click/box/mask; reporting and .
- DAVIS interactive: Scribble/click prompts; Area-under-Curve (), @60 sec.
- Zero-shot static images: 37 datasets, 1/5-click mIoU.
- Sample metrics: , , and as per standard definitions.
A plausible implication is that SA-V establishes a modular framework for open-world segmentation at massive scale, with verified per-frame masks and transparent quality control. The unconstrained prompt-driven model and dataset are directly applicable to research in interactive segmentation, open-world detection, and downstream tasks such as video understanding, object tracking, and scene analysis.