Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 177 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 31 tok/s Pro
GPT-5 High 32 tok/s Pro
GPT-4o 93 tok/s Pro
Kimi K2 183 tok/s Pro
GPT OSS 120B 447 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Segment Anything Video (SA-V) Dataset

Updated 11 November 2025
  • The dataset introduces promptable open-world video segmentation with masklets, combining human-in-the-loop verification and rapid SAM 2-assisted annotation.
  • SA-V comprises 50,900 diverse videos with extensive spatio-temporal mask annotations, enabling detailed analysis across indoor and outdoor scenes.
  • The underlying streaming-memory transformer leverages ViT features and memory attention to ensure efficient, consistent mask propagation and robust tracking.

The Segment Anything Video (SA-V) Dataset is the largest publicly available resource for promptable open-world video segmentation, designed to advance research in visual segmentation tasks across diverse domains. Developed via a human-in-the-loop data engine built around the streaming-memory transformer SAM 2, SA-V provides spatio-temporal mask tracks ("masklets") for arbitrary objects and object parts, annotated and quality-assured at scale. The dataset and model are fully open, distributed under permissive licenses to catalyze innovation in video understanding.

1. Composition and Structure

SA-V comprises 50,900 in-the-wild videos (54% indoor, 46% outdoor) sourced by crowd workers from 47 countries. These videos total 196 hours and approximately 4.2 million frames, sampled at 24 FPS, with an average length of about 14 seconds per clip. Video resolution spans from 240p to 4K, averaging 1401×1037 pixels. Segmentation coverage includes 642,600 masklets representing spatio-temporal tracks; 190,900 are manually annotated, while 451,700 are generated automatically and then verified by human annotators. Masks are subsampled at 6 FPS, yielding 35.5 million total per-frame masks (over 10 million manual-only). Object coverage is unconstrained—annotators may select any object or part with visible and clear spatial boundaries.

The mask size distribution reveals that over 88% of masks occupy less than 10% of the frame area. Disappearance and reappearance events, which are significant for temporally consistent tracking, are observed in 42.5% of manual masklets and 27.7% overall.

2. Annotation Protocols and Quality Control

Annotations are organized as masklets, each represented by per-frame binary masks stored in run-length encoded JSON or lossless PNG formats. The masking process is rooted in “promptable segmentation”: annotators supply positive/negative clicks, bounding boxes, or scribbled masks as interaction cues. On average, 2.68 clicks are required for each edited frame. Corrections and prompting occur on roughly 19% of frames within each masklet, reflecting the semi-automatic nature of mask propagation.

A three-phase annotation protocol increases speed and quality:

  • Phase 1: Manual brushing using SAM on every frame (37.8 sec/frame).
  • Phase 2: Annotate the initial frame, propagate masks with SAM 2 Mask, then correct as needed (7.4 sec/frame).
  • Phase 3: Full SAM 2 with memory enables annotation of the first frame via click/box/mask and efficient propagation with minimal edits (4.5 sec/frame, 19% edited frames).

Quality control mandates dual-team verification. Masklets are marked "satisfactory" if objects are tracked consistently and completely; unsatisfactory tracks are refined or discarded. Object boundaries may include both full objects and parts as long as they are clearly delineated. The “Mask Alignment Score” (MAS) quantifies per-masklet consistency:

MaskAlignment=100%{frames:IoU(MphaseX,Mphase1)>0.75}total frames\text{MaskAlignment} = 100\% \cdot \frac{|\{ \text{frames}: \text{IoU}(M_{\text{phaseX}}, M_{\text{phase1}}) > 0.75 \}|}{\text{total frames}}

3. Data Engine Workflow and Streaming-Memory Model

SA-V’s collection pipeline uses a model-in-the-loop workflow:

  • Interactive Phases: Phase 1 (manual+SAM annotation) yields 16,000 masklets on 1,400 videos; Phase 2 (SAM+SAM 2 Mask) produces 63,500 masklets (5× speedup); Phase 3 (fully interactive SAM 2 with memory) delivers 197,000 masklets (8.4× speedup).
  • Automatic Masklets: Grid-of-clicks initialization (32×32, 16×16, 4×4 crops) prompts SAM 2 to generate masklets, which undergo verification before inclusion.

The underlying model is a streaming-memory transformer:

  • For t=1Tt = 1 \ldots T frames, image features FtF_t are extracted (ViT via Hiera MAE pretraining).
  • Memory attention EtE_t incorporates recent spatial memories MtkM_{t−k} (N=6N=6 FIFO frames), object pointers (M2M\leq2 prompted frames), and cross-attention from prompts.
  • The mask decoder yields masks, occlusion confidence, and predicted IoU for each frame. The highest-IoU mask propagates forward. Object pointers are 32–64 dimensional tokens from decoder heads.
  • Prompts support direct click, box, or mask embeddings; occlusion prediction addresses visibility across temporal gaps.

Key scoring formulas mirror standard segmentation metrics:

  • Mean IoU for images: mIoU=1Ni=1NPiGiPiGi\text{mIoU} = \frac{1}{N} \sum_{i=1}^{N} \frac{|P_i \cap G_i|}{|P_i \cup G_i|}
  • Mask Alignment (see above)

Automatic masks undergo filtering (components <200 pixels removed, holes <200 pixels filled) for spatial coherence.

4. Dataset Splits and Demographics

SA-V is partitioned as follows:

  • Training: ~50,600 videos, all masklets minus val/test samples.
  • Validation: 155 challenging videos, 293 manually verified masklets (sampled at 6 FPS).
  • Test: 150 videos, 278 masklets.

Splits are determined by author and geography to prevent near-duplication. Annotator demographics: 274 self-reported male, 236 female; ages 18–24 (109), 25–40 (305), 41–64 (88). Fairness evaluation using Ego-Exo4D reveals less than 1 point gap in 𝒥𝒥 (IoU) scores across gender and age for 3-click and mask prompts. Object categories are open-world; annotation is not constrained by predefined taxonomy.

5. File Formats and Organization

SA-V maintains a rigorous directory and file structure:

1
2
3
4
5
6
7
8
9
10
11
12
13
Root/
  videos/
    {video_id}/
      video.mp4
      frames/{000001.jpg, ...}
  annotations/
    masklets_train.json
    masklets_val.json
    masklets_test.json
    masklet_schema.json
  metadata/
    videos.csv  # video_id, fps, resolution, length_sec, location
    annot_stats.csv  # masklet_id, type, num_frames, avg_mask_area

Each masklets_*.json is COCO-style, with the following fields:

Field Type/Format Description
masklet_id string/int Unique masklet track identifier
video_id string/int Video key
frames list Frame indices or names
masks list (RLE/PNG) Dense per-frame binary masks
prompts list Each: frame, list of clicks (x,y,±), type
quality string 'verified' or 'unsatisfactory'

Supporting metadata links masklets to video sources and annotator statistics.

6. Licensing, Release, and Benchmarking

SA-V is distributed under the Creative Commons Attribution 4.0 (CC BY 4.0) license. All model code and weights for SAM 2 are released under Apache 2.0. The dataset, codebase, and interactive demo are available at:

Benchmarking is comprehensive:

  • Zero-shot promptable segmentation: 9 video datasets (e.g., EndoVis, VIPSeg), 3-click protocol, reporting 𝒥𝒥 (IoU) and F (contour) per DAVIS standards.
  • Semi-supervised video object segmentation (VOS): 17 datasets, prompts by click/box/mask; reporting 𝒥𝒥 and F.
  • DAVIS interactive: Scribble/click prompts; Area-under-Curve (AUCAUC), 𝒥F𝒥ℱ@60 sec.
  • Zero-shot static images: 37 datasets, 1/5-click mIoU.
  • Sample metrics: mIoU=1NPG/PG\text{mIoU} = \frac{1}{N}\sum|P\cap G|/|P\cup G|, 𝒥𝒥, and F as per standard definitions.

A plausible implication is that SA-V establishes a modular framework for open-world segmentation at massive scale, with verified per-frame masks and transparent quality control. The unconstrained prompt-driven model and dataset are directly applicable to research in interactive segmentation, open-world detection, and downstream tasks such as video understanding, object tracking, and scene analysis.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Segment Anything Video (SA-V) Dataset.