Papers
Topics
Authors
Recent
2000 character limit reached

SA-Co Benchmark: Open-Vocabulary Concept Segmentation

Updated 24 November 2025
  • The benchmark introduces SA-Co, a large-scale framework that evaluates promptable concept segmentation using noun phrases, exemplars, and mixed prompts in both images and videos.
  • SA-Co employs a multi-phase data construction and annotation pipeline with rigorous protocols, delivering evaluations based on advanced metrics like cgF1, pHOTA, and IoU.
  • The benchmark advances computer vision research by enabling zero-shot, few-shot, and synthetic domain adaptation evaluations, demonstrating significant performance gains over previous baselines.

Segment Anything with Concepts (SA-Co) is a large-scale, open-vocabulary benchmark for Promptable Concept Segmentation (PCS) in images and videos, introduced in conjunction with SAM 3 ("Segment Anything Model 3"). PCS tasks require models to detect, segment, and track all instances of user-specified concepts, defined via short noun phrases, image exemplars, or their combination, assigning consistent identities to these instances across frames. SA-Co enables rigorous evaluation of open-vocabulary recognition, instance segmentation quality, and tracking accuracy, supporting noun-phrase, exemplar, and mixed prompt formats.

1. Benchmark Definition and Scope

SA-Co is designed to quantify model performance on PCS tasks, where the input prompt can be a noun phrase (e.g., "striped cat"), an image exemplar (positive/negative crop), or both. The benchmark mandates segmentation of every matching instance in the provided media and unique ID assignment for temporal consistency in video. Supported objectives comprise:

  • Open-vocabulary recognition (presence detection)
  • Instance segmentation boundary quality
  • Instance tracking accuracy in video settings

SA-Co supports prompt types: (1) noun phrase, (2) exemplar, (3) mixed.

2. Dataset Construction and Statistics

The SA-Co dataset is produced via a multi-phase, scalable data engine pipeline:

  • Phase 1: Mining image-caption-derived noun phrases → mask proposals (SAM 2 + OWLv2) → human Mask Verification (MV) and Exhaustivity Verification (EV) → manual corrections, yielding the HQ image set (4.3M image-NPs).
  • Phase 2: Replacement of human MV/EV by Llama-based AI verifiers; introduction of hard negative generator (ontology + MLLM sourcing) filtered by spurious mask triggers, adding 122M image-NPs.
  • Phase 3: Expansion to 15 visual domains, adoption of a fully AI synthetic verification mode ("SYN"), contributing 19.5M image-NPs.
  • Phase 4 (video): Scene/motion-based mining, SAM 3 pseudo-masks, and manual annotation to form 52.5K videos, 134K video-NP pairs, and 467K masklets.

Final splits are:

Split Images/Videos Concept Labels Masks
HQ 5.2M images 4M unique NPs 52M
SYN 39M images 1.7B image-NPs 1.4B
EXT 9.3M images 15 datasets 70M
VID 52.5K videos 24.8K NPs 467K

Evaluation subsets encompass ~207K unique NPs, 121K media, and 3M media-NPs, with specific splits for images (Gold, Silver, Bronze, Bio) and videos (SA-V, YT-Temporal-1B, SmartGlasses). Ontologically, 22.4M Wikidata nodes are mapped to 17 top-level categories (e.g., animals, human, tools), with split distributions documented in SAM 3.

3. Annotation Procedures and Evaluation Protocols

Image Annotation

  • Mask Verification (MV): Accept/reject candidate triplet (image, NP, mask).
  • Exhaustivity Verification (EV): Verify that accepted masks exhaustively cover all instances.
  • Correction: Manual refinement for exhaustive mask annotation.
  • Gold splits are labeled by three independent experts to estimate human upper-bound accuracy.

Video Annotation

  • Detection per frame, propagation through SAM 3 tracker, and matching of new detections.
  • Masklets unconfirmed for a delay (T=15 frames) are discarded based on Masklet Detection Score (MDS).
  • Masklet suppression is performed if MDS falls below zero over the lifespan.
  • Tracker re-anchoring is enforced at regular intervals (N=16 frames).

Metrics

  • Localization: Micro-F1 across IoU thresholds τ=0.5,...,0.95\tau = 0.5, ..., 0.95

pmF1=110τ=0.5:0.05:0.952 TPτ2 TPτ+FPτ+FNτ\text{pmF}_1 = \frac{1}{10} \sum_{\tau=0.5:0.05:0.95} \frac{2\ TP^\tau}{2\ TP^\tau + FP^\tau + FN^\tau}

IoU(m,g^)=mg^mg^\text{IoU}(m, \hat{g}) = \frac{|m \wedge \hat{g}|}{|m \vee \hat{g}|}

  • Image-level Classification: Matthews Corr. Coeff. (IL_MCC)

IL_MCC=TP×TNFP×FN(TP+FP)(TP+FN)(TN+FP)(TN+FN)\text{IL\_MCC} = \frac{TP \times TN - FP \times FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}

  • Combined: cgF1=100×pmF1×IL_MCCcgF_1 = 100 \times pmF_1 \times IL\_MCC
  • Video: cgF₁, volume-IoU, plus pHOTA (HOTA on video-NP pairs), TETA
  • Standard benchmarks: AP for box/mask detection (COCO, LVIS), MAE/Accuracy for counting, J/G for VOS.

Split Protocols

  • Image QA splits are zero-shot (no training overlap).
  • Few-shot adaptation possible (1–10 examples) for select datasets.
  • Video evaluation is performed without fine-tuning on evaluation sets.

4. Comparative Baseline Performance and Ablation Studies

Image-Level

  • On SA-Co Gold: Best prior baseline achieves 124.61 \approx 24.6, SAM 3 achieves $1 = 54.1$ (+2× increase, \sim74% of human).
  • Similar performance gains on Silver/Bronze/Bio splits.
  • Exemplar prompt on COCO/LVIS/ODinW13: T-Rex2 best prior 1+=58.51^+ = 58.5; SAM 3+exemplar 1+=76.81^+ = 76.8 (+18.3).

Video-Level

  • Text-prompted tracking (SA-V, YT-Temporal, SmartGlasses): SAM 3 cgF1=3050%cgF_1 = 30-50\% (>3×3\times baseline).
  • On LVVIS, BURST, YTVIS, OVIS: SAM 3 $36.3$ mAP vs. $20.8$ for baselines (+75% rel).

Other Visual Segmentation Tasks

  • VOS (MOSEv2): SAM 3 J=83.5J=83.5 vs. prior best J=75.3J=75.3 (+6.5+6.5 improvement).
  • Interactive image segmentation: 5-click mIoU=85.1mIoU = 85.1 (SAM 3) vs. $84.3$ (SAM 2.1).

Counting Tasks

  • CountBench/PixMo-Count: SAM 3 with IoM-NMS, MAE =0.12=0.12, accuracy =93.8%=93.8\%; prior best accuracy =92.4%=92.4\%.

Ablations

  • Presence head increases $1$ by +1.5+1.5 (IL:0.770.82IL: 0.77 \rightarrow 0.82).
  • Hard negatives unlock +15+15 (IL 0.440.680.44 \rightarrow 0.68).
  • Data subset analysis: EXT+SYN+HQ $1=47.4$; without HQ $1=23.7$; with SYN only $1=32.8$.
  • AI verifier improvements: EV AI +7.2+7.2, MV AI +1.1+1.1, closing half gap to human annotation.

Domain Adaptation

Holding out "Food&Drink" domain and fine-tuning on synthetic only (SYN-Food) matches HQ-Food performance at large scale without human annotation.

5. Data Access and Benchmark Usage

Annotations and evaluation code are publicly available at https://github.com/facebookresearch/sam3. Annotation format is federated JSON containing media_id, phrase, masks, and hard_negatives.

Benchmarking Process:

  1. Run model on each (media, phrase) pair, outputting mask polygons and confidence scores.
  2. Pack results into SAM 3 JSON format.
  3. Evaluate results using provided scripts:
    1
    2
    
    python eval_image.py --pred sampreds.json --gt saco_gold.json --metric cgF1 --threshold 0.5
    python eval_video.py --pred vidpreds.json --gt sacovideo.json --metrics cgF1 pHOTA TETA --iou_thresh 0.5

Video tracking pseudocode:

1
2
3
4
5
6
7
8
9
prev_masklets = []
for t, frame in enumerate(video):
   dets = Detector(frame, prompt)
   props = Tracker.propagate(prev_masklets)
   matched, unmatched = match(props, dets, iou_thresh=0.3)
   masklets = suppress_unconfirmed(matched) + new_masklets(unmatched)
   if t % 16==0: masklets = re_prompt(masklets, dets)
   output_after_delay(masklets, delay=15)
   prev_masklets = masklets

6. Category Coverage and Ontological Structure

SA-Co employs an ontology comprising 22.4M Wikidata nodes, collapsed to 17 top-level categories, ensuring broad concept coverage. The training distribution spans all domains, including fine-grained and long-tail NPs, and evaluation subsets span external domains and medical data.

A plausible implication is that categorical breadth and fine-grained prompt coverage synergistically support robust open-vocabulary evaluation and realistic assessment of semantic generalization across domains.

7. Significance and Prospects

SA-Co establishes a standardized, rigorous, and scalable framework for evaluating promptable concept segmentation in both images and videos, with extensive coverage of object categories and ontological granularity. By supporting zero-shot, few-shot, and synthetic domain adaptation evaluation, the benchmark enables assessment of PCS model generality and performance ceiling relative to human-level segmentation and recognition. The methodological advances in dataset generation, annotation protocols, metric design, and open-source tooling collectively position SA-Co as a central resource for advancing open-vocabulary segmentation research in computer vision (Carion et al., 20 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Segment Anything with Concepts (SA-Co) Benchmark.