Papers
Topics
Authors
Recent
Search
2000 character limit reached

SA-Co Benchmark: Open-Vocabulary Concept Segmentation

Updated 24 November 2025
  • The benchmark introduces SA-Co, a large-scale framework that evaluates promptable concept segmentation using noun phrases, exemplars, and mixed prompts in both images and videos.
  • SA-Co employs a multi-phase data construction and annotation pipeline with rigorous protocols, delivering evaluations based on advanced metrics like cgF1, pHOTA, and IoU.
  • The benchmark advances computer vision research by enabling zero-shot, few-shot, and synthetic domain adaptation evaluations, demonstrating significant performance gains over previous baselines.

Segment Anything with Concepts (SA-Co) is a large-scale, open-vocabulary benchmark for Promptable Concept Segmentation (PCS) in images and videos, introduced in conjunction with SAM 3 ("Segment Anything Model 3"). PCS tasks require models to detect, segment, and track all instances of user-specified concepts, defined via short noun phrases, image exemplars, or their combination, assigning consistent identities to these instances across frames. SA-Co enables rigorous evaluation of open-vocabulary recognition, instance segmentation quality, and tracking accuracy, supporting noun-phrase, exemplar, and mixed prompt formats.

1. Benchmark Definition and Scope

SA-Co is designed to quantify model performance on PCS tasks, where the input prompt can be a noun phrase (e.g., "striped cat"), an image exemplar (positive/negative crop), or both. The benchmark mandates segmentation of every matching instance in the provided media and unique ID assignment for temporal consistency in video. Supported objectives comprise:

  • Open-vocabulary recognition (presence detection)
  • Instance segmentation boundary quality
  • Instance tracking accuracy in video settings

SA-Co supports prompt types: (1) noun phrase, (2) exemplar, (3) mixed.

2. Dataset Construction and Statistics

The SA-Co dataset is produced via a multi-phase, scalable data engine pipeline:

  • Phase 1: Mining image-caption-derived noun phrases → mask proposals (SAM 2 + OWLv2) → human Mask Verification (MV) and Exhaustivity Verification (EV) → manual corrections, yielding the HQ image set (4.3M image-NPs).
  • Phase 2: Replacement of human MV/EV by Llama-based AI verifiers; introduction of hard negative generator (ontology + MLLM sourcing) filtered by spurious mask triggers, adding 122M image-NPs.
  • Phase 3: Expansion to 15 visual domains, adoption of a fully AI synthetic verification mode ("SYN"), contributing 19.5M image-NPs.
  • Phase 4 (video): Scene/motion-based mining, SAM 3 pseudo-masks, and manual annotation to form 52.5K videos, 134K video-NP pairs, and 467K masklets.

Final splits are:

Split Images/Videos Concept Labels Masks
HQ 5.2M images 4M unique NPs 52M
SYN 39M images 1.7B image-NPs 1.4B
EXT 9.3M images 15 datasets 70M
VID 52.5K videos 24.8K NPs 467K

Evaluation subsets encompass ~207K unique NPs, 121K media, and 3M media-NPs, with specific splits for images (Gold, Silver, Bronze, Bio) and videos (SA-V, YT-Temporal-1B, SmartGlasses). Ontologically, 22.4M Wikidata nodes are mapped to 17 top-level categories (e.g., animals, human, tools), with split distributions documented in SAM 3.

3. Annotation Procedures and Evaluation Protocols

Image Annotation

  • Mask Verification (MV): Accept/reject candidate triplet (image, NP, mask).
  • Exhaustivity Verification (EV): Verify that accepted masks exhaustively cover all instances.
  • Correction: Manual refinement for exhaustive mask annotation.
  • Gold splits are labeled by three independent experts to estimate human upper-bound accuracy.

Video Annotation

  • Detection per frame, propagation through SAM 3 tracker, and matching of new detections.
  • Masklets unconfirmed for a delay (T=15 frames) are discarded based on Masklet Detection Score (MDS).
  • Masklet suppression is performed if MDS falls below zero over the lifespan.
  • Tracker re-anchoring is enforced at regular intervals (N=16 frames).

Metrics

  • Localization: Micro-F1 across IoU thresholds τ=0.5,...,0.95\tau = 0.5, ..., 0.95

pmF1=110τ=0.5:0.05:0.952 TPτ2 TPτ+FPτ+FNτ\text{pmF}_1 = \frac{1}{10} \sum_{\tau=0.5:0.05:0.95} \frac{2\ TP^\tau}{2\ TP^\tau + FP^\tau + FN^\tau}

IoU(m,g^)=mg^mg^\text{IoU}(m, \hat{g}) = \frac{|m \wedge \hat{g}|}{|m \vee \hat{g}|}

  • Image-level Classification: Matthews Corr. Coeff. (IL_MCC)

IL_MCC=TP×TNFP×FN(TP+FP)(TP+FN)(TN+FP)(TN+FN)\text{IL\_MCC} = \frac{TP \times TN - FP \times FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}

  • Combined: cgF1=100×pmF1×IL_MCCcgF_1 = 100 \times pmF_1 \times IL\_MCC
  • Video: cgF₁, volume-IoU, plus pHOTA (HOTA on video-NP pairs), TETA
  • Standard benchmarks: AP for box/mask detection (COCO, LVIS), MAE/Accuracy for counting, J/G for VOS.

Split Protocols

  • Image QA splits are zero-shot (no training overlap).
  • Few-shot adaptation possible (1–10 examples) for select datasets.
  • Video evaluation is performed without fine-tuning on evaluation sets.

4. Comparative Baseline Performance and Ablation Studies

Image-Level

  • On SA-Co Gold: Best prior baseline achieves 124.61 \approx 24.6, SAM 3 achieves $1 = 54.1$ (+2× increase, \sim74% of human).
  • Similar performance gains on Silver/Bronze/Bio splits.
  • Exemplar prompt on COCO/LVIS/ODinW13: T-Rex2 best prior 1+=58.51^+ = 58.5; SAM 3+exemplar 1+=76.81^+ = 76.8 (+18.3).

Video-Level

  • Text-prompted tracking (SA-V, YT-Temporal, SmartGlasses): SAM 3 pmF1=110τ=0.5:0.05:0.952 TPτ2 TPτ+FPτ+FNτ\text{pmF}_1 = \frac{1}{10} \sum_{\tau=0.5:0.05:0.95} \frac{2\ TP^\tau}{2\ TP^\tau + FP^\tau + FN^\tau}0 (>pmF1=110τ=0.5:0.05:0.952 TPτ2 TPτ+FPτ+FNτ\text{pmF}_1 = \frac{1}{10} \sum_{\tau=0.5:0.05:0.95} \frac{2\ TP^\tau}{2\ TP^\tau + FP^\tau + FN^\tau}1 baseline).
  • On LVVIS, BURST, YTVIS, OVIS: SAM 3 pmF1=110τ=0.5:0.05:0.952 TPτ2 TPτ+FPτ+FNτ\text{pmF}_1 = \frac{1}{10} \sum_{\tau=0.5:0.05:0.95} \frac{2\ TP^\tau}{2\ TP^\tau + FP^\tau + FN^\tau}2 mAP vs. pmF1=110τ=0.5:0.05:0.952 TPτ2 TPτ+FPτ+FNτ\text{pmF}_1 = \frac{1}{10} \sum_{\tau=0.5:0.05:0.95} \frac{2\ TP^\tau}{2\ TP^\tau + FP^\tau + FN^\tau}3 for baselines (+75% rel).

Other Visual Segmentation Tasks

  • VOS (MOSEv2): SAM 3 pmF1=110τ=0.5:0.05:0.952 TPτ2 TPτ+FPτ+FNτ\text{pmF}_1 = \frac{1}{10} \sum_{\tau=0.5:0.05:0.95} \frac{2\ TP^\tau}{2\ TP^\tau + FP^\tau + FN^\tau}4 vs. prior best pmF1=110τ=0.5:0.05:0.952 TPτ2 TPτ+FPτ+FNτ\text{pmF}_1 = \frac{1}{10} \sum_{\tau=0.5:0.05:0.95} \frac{2\ TP^\tau}{2\ TP^\tau + FP^\tau + FN^\tau}5 (pmF1=110τ=0.5:0.05:0.952 TPτ2 TPτ+FPτ+FNτ\text{pmF}_1 = \frac{1}{10} \sum_{\tau=0.5:0.05:0.95} \frac{2\ TP^\tau}{2\ TP^\tau + FP^\tau + FN^\tau}6 improvement).
  • Interactive image segmentation: 5-click pmF1=110τ=0.5:0.05:0.952 TPτ2 TPτ+FPτ+FNτ\text{pmF}_1 = \frac{1}{10} \sum_{\tau=0.5:0.05:0.95} \frac{2\ TP^\tau}{2\ TP^\tau + FP^\tau + FN^\tau}7 (SAM 3) vs. pmF1=110τ=0.5:0.05:0.952 TPτ2 TPτ+FPτ+FNτ\text{pmF}_1 = \frac{1}{10} \sum_{\tau=0.5:0.05:0.95} \frac{2\ TP^\tau}{2\ TP^\tau + FP^\tau + FN^\tau}8 (SAM 2.1).

Counting Tasks

  • CountBench/PixMo-Count: SAM 3 with IoM-NMS, MAE pmF1=110τ=0.5:0.05:0.952 TPτ2 TPτ+FPτ+FNτ\text{pmF}_1 = \frac{1}{10} \sum_{\tau=0.5:0.05:0.95} \frac{2\ TP^\tau}{2\ TP^\tau + FP^\tau + FN^\tau}9, accuracy IoU(m,g^)=mg^mg^\text{IoU}(m, \hat{g}) = \frac{|m \wedge \hat{g}|}{|m \vee \hat{g}|}0; prior best accuracy IoU(m,g^)=mg^mg^\text{IoU}(m, \hat{g}) = \frac{|m \wedge \hat{g}|}{|m \vee \hat{g}|}1.

Ablations

  • Presence head increases IoU(m,g^)=mg^mg^\text{IoU}(m, \hat{g}) = \frac{|m \wedge \hat{g}|}{|m \vee \hat{g}|}2 by IoU(m,g^)=mg^mg^\text{IoU}(m, \hat{g}) = \frac{|m \wedge \hat{g}|}{|m \vee \hat{g}|}3 (IoU(m,g^)=mg^mg^\text{IoU}(m, \hat{g}) = \frac{|m \wedge \hat{g}|}{|m \vee \hat{g}|}4).
  • Hard negatives unlock IoU(m,g^)=mg^mg^\text{IoU}(m, \hat{g}) = \frac{|m \wedge \hat{g}|}{|m \vee \hat{g}|}5 (IL IoU(m,g^)=mg^mg^\text{IoU}(m, \hat{g}) = \frac{|m \wedge \hat{g}|}{|m \vee \hat{g}|}6).
  • Data subset analysis: EXT+SYN+HQ IoU(m,g^)=mg^mg^\text{IoU}(m, \hat{g}) = \frac{|m \wedge \hat{g}|}{|m \vee \hat{g}|}7; without HQ IoU(m,g^)=mg^mg^\text{IoU}(m, \hat{g}) = \frac{|m \wedge \hat{g}|}{|m \vee \hat{g}|}8; with SYN only IoU(m,g^)=mg^mg^\text{IoU}(m, \hat{g}) = \frac{|m \wedge \hat{g}|}{|m \vee \hat{g}|}9.
  • AI verifier improvements: EV AI IL_MCC=TP×TNFP×FN(TP+FP)(TP+FN)(TN+FP)(TN+FN)\text{IL\_MCC} = \frac{TP \times TN - FP \times FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}0, MV AI IL_MCC=TP×TNFP×FN(TP+FP)(TP+FN)(TN+FP)(TN+FN)\text{IL\_MCC} = \frac{TP \times TN - FP \times FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}1, closing half gap to human annotation.

Domain Adaptation

Holding out "Food&Drink" domain and fine-tuning on synthetic only (SYN-Food) matches HQ-Food performance at large scale without human annotation.

5. Data Access and Benchmark Usage

Annotations and evaluation code are publicly available at https://github.com/facebookresearch/sam3. Annotation format is federated JSON containing media_id, phrase, masks, and hard_negatives.

Benchmarking Process:

  1. Run model on each (media, phrase) pair, outputting mask polygons and confidence scores.
  2. Pack results into SAM 3 JSON format.
  3. Evaluate results using provided scripts: IL_MCC=TP×TNFP×FN(TP+FP)(TP+FN)(TN+FP)(TN+FN)\text{IL\_MCC} = \frac{TP \times TN - FP \times FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}2

Video tracking pseudocode:

IL_MCC=TP×TNFP×FN(TP+FP)(TP+FN)(TN+FP)(TN+FN)\text{IL\_MCC} = \frac{TP \times TN - FP \times FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}3

6. Category Coverage and Ontological Structure

SA-Co employs an ontology comprising 22.4M Wikidata nodes, collapsed to 17 top-level categories, ensuring broad concept coverage. The training distribution spans all domains, including fine-grained and long-tail NPs, and evaluation subsets span external domains and medical data.

A plausible implication is that categorical breadth and fine-grained prompt coverage synergistically support robust open-vocabulary evaluation and realistic assessment of semantic generalization across domains.

7. Significance and Prospects

SA-Co establishes a standardized, rigorous, and scalable framework for evaluating promptable concept segmentation in both images and videos, with extensive coverage of object categories and ontological granularity. By supporting zero-shot, few-shot, and synthetic domain adaptation evaluation, the benchmark enables assessment of PCS model generality and performance ceiling relative to human-level segmentation and recognition. The methodological advances in dataset generation, annotation protocols, metric design, and open-source tooling collectively position SA-Co as a central resource for advancing open-vocabulary segmentation research in computer vision (Carion et al., 20 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Segment Anything with Concepts (SA-Co) Benchmark.