SA-Co Benchmark: Open-Vocabulary Concept Segmentation

Updated 24 November 2025

The benchmark introduces SA-Co, a large-scale framework that evaluates promptable concept segmentation using noun phrases, exemplars, and mixed prompts in both images and videos.
SA-Co employs a multi-phase data construction and annotation pipeline with rigorous protocols, delivering evaluations based on advanced metrics like cgF1, pHOTA, and IoU.
The benchmark advances computer vision research by enabling zero-shot, few-shot, and synthetic domain adaptation evaluations, demonstrating significant performance gains over previous baselines.

Segment Anything with Concepts (SA-Co) is a large-scale, open-vocabulary benchmark for Promptable Concept Segmentation (PCS) in images and videos, introduced in conjunction with SAM 3 ("Segment Anything Model 3"). PCS tasks require models to detect, segment, and track all instances of user-specified concepts, defined via short noun phrases, image exemplars, or their combination, assigning consistent identities to these instances across frames. SA-Co enables rigorous evaluation of open-vocabulary recognition, instance segmentation quality, and tracking accuracy, supporting noun-phrase, exemplar, and mixed prompt formats.

1. Benchmark Definition and Scope

SA-Co is designed to quantify model performance on PCS tasks, where the input prompt can be a noun phrase (e.g., "striped cat"), an image exemplar (positive/negative crop), or both. The benchmark mandates segmentation of every matching instance in the provided media and unique ID assignment for temporal consistency in video. Supported objectives comprise:

Open-vocabulary recognition (presence detection)
Instance segmentation boundary quality
Instance tracking accuracy in video settings

SA-Co supports prompt types: (1) noun phrase, (2) exemplar, (3) mixed.

2. Dataset Construction and Statistics

The SA-Co dataset is produced via a multi-phase, scalable data engine pipeline:

Phase 1: Mining image-caption-derived noun phrases → mask proposals (SAM 2 + OWLv2) → human Mask Verification (MV) and Exhaustivity Verification (EV) → manual corrections, yielding the HQ image set (4.3M image-NPs).
Phase 2: Replacement of human MV/EV by Llama-based AI verifiers; introduction of hard negative generator (ontology + MLLM sourcing) filtered by spurious mask triggers, adding 122M image-NPs.
Phase 3: Expansion to 15 visual domains, adoption of a fully AI synthetic verification mode ("SYN"), contributing 19.5M image-NPs.
Phase 4 (video): Scene/motion-based mining, SAM 3 pseudo-masks, and manual annotation to form 52.5K videos, 134K video-NP pairs, and 467K masklets.

Final splits are:

Split	Images/Videos	Concept Labels	Masks
HQ	5.2M images	4M unique NPs	52M
SYN	39M images	1.7B image-NPs	1.4B
EXT	9.3M images	15 datasets	70M
VID	52.5K videos	24.8K NPs	467K

Evaluation subsets encompass ~207K unique NPs, 121K media, and 3M media-NPs, with specific splits for images (Gold, Silver, Bronze, Bio) and videos (SA-V, YT-Temporal-1B, SmartGlasses). Ontologically, 22.4M Wikidata nodes are mapped to 17 top-level categories (e.g., animals, human, tools), with split distributions documented in SAM 3.

3. Annotation Procedures and Evaluation Protocols

Image Annotation

Mask Verification (MV): Accept/reject candidate triplet (image, NP, mask).
Exhaustivity Verification (EV): Verify that accepted masks exhaustively cover all instances.
Correction: Manual refinement for exhaustive mask annotation.
Gold splits are labeled by three independent experts to estimate human upper-bound accuracy.

Video Annotation

Detection per frame, propagation through SAM 3 tracker, and matching of new detections.
Masklets unconfirmed for a delay (T=15 frames) are discarded based on Masklet Detection Score (MDS).
Masklet suppression is performed if MDS falls below zero over the lifespan.
Tracker re-anchoring is enforced at regular intervals (N=16 frames).

Metrics

Localization: Micro-F1 across IoU thresholds $\tau = 0.5, ..., 0.95$

$\text{pmF}_1 = \frac{1}{10} \sum_{\tau=0.5:0.05:0.95} \frac{2\ TP^\tau}{2\ TP^\tau + FP^\tau + FN^\tau}$

$\text{IoU}(m, \hat{g}) = \frac{|m \wedge \hat{g}|}{|m \vee \hat{g}|}$

Image-level Classification: Matthews Corr. Coeff. (IL_MCC)

$\text{IL\_MCC} = \frac{TP \times TN - FP \times FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}$

Combined: $cgF_1 = 100 \times pmF_1 \times IL\_MCC$
Video: cgF₁, volume-IoU, plus pHOTA (HOTA on video-NP pairs), TETA
Standard benchmarks: AP for box/mask detection (COCO, LVIS), MAE/Accuracy for counting, J/G for VOS.

Split Protocols

Image QA splits are zero-shot (no training overlap).
Few-shot adaptation possible (1–10 examples) for select datasets.
Video evaluation is performed without fine-tuning on evaluation sets.

4. Comparative Baseline Performance and Ablation Studies

Image-Level

On SA-Co Gold: Best prior baseline achieves $1 \approx 24.6$ , SAM 3 achieves $1 = 54.1$ (+2× increase, $\sim$ 74% of human).
Similar performance gains on Silver/Bronze/Bio splits.
Exemplar prompt on COCO/LVIS/ODinW13: T-Rex2 best prior $1^+ = 58.5$ ; SAM 3+exemplar $1^+ = 76.8$ (+18.3).

Video-Level

Text-prompted tracking (SA-V, YT-Temporal, SmartGlasses): SAM 3 $cgF_1 = 30-50\%$ (> $3\times$ baseline).
On LVVIS, BURST, YTVIS, OVIS: SAM 3 $36.3$ mAP vs. $20.8$ for baselines (+75% rel).

Other Visual Segmentation Tasks

VOS (MOSEv2): SAM 3 $J=83.5$ vs. prior best $J=75.3$ ( $+6.5$ improvement).
Interactive image segmentation: 5-click $mIoU = 85.1$ (SAM 3) vs. $84.3$ (SAM 2.1).

Counting Tasks

CountBench/PixMo-Count: SAM 3 with IoM-NMS, MAE $=0.12$ , accuracy $=93.8\%$ ; prior best accuracy $=92.4\%$ .

Ablations

Presence head increases $1$ by $+1.5$ ( $IL: 0.77 \rightarrow 0.82$ ).
Hard negatives unlock $+15$ (IL $0.44 \rightarrow 0.68$ ).
Data subset analysis: EXT+SYN+HQ $1=47.4$; without HQ $1=23.7$; with SYN only $1=32.8$.
AI verifier improvements: EV AI $+7.2$ , MV AI $+1.1$ , closing half gap to human annotation.

Domain Adaptation

Holding out "Food&Drink" domain and fine-tuning on synthetic only (SYN-Food) matches HQ-Food performance at large scale without human annotation.

5. Data Access and Benchmark Usage

Annotations and evaluation code are publicly available at https://github.com/facebookresearch/sam3. Annotation format is federated JSON containing media_id, phrase, masks, and hard_negatives.

Benchmarking Process:

Run model on each (media, phrase) pair, outputting mask polygons and confidence scores.
Pack results into SAM 3 JSON format.

Evaluate results using provided scripts:

1 2	python eval_image.py --pred sampreds.json --gt saco_gold.json --metric cgF1 --threshold 0.5 python eval_video.py --pred vidpreds.json --gt sacovideo.json --metrics cgF1 pHOTA TETA --iou_thresh 0.5

Video tracking pseudocode:

prev_masklets = []
for t, frame in enumerate(video):
   dets = Detector(frame, prompt)
   props = Tracker.propagate(prev_masklets)
   matched, unmatched = match(props, dets, iou_thresh=0.3)
   masklets = suppress_unconfirmed(matched) + new_masklets(unmatched)
   if t % 16==0: masklets = re_prompt(masklets, dets)
   output_after_delay(masklets, delay=15)
   prev_masklets = masklets

6. Category Coverage and Ontological Structure

SA-Co employs an ontology comprising 22.4M Wikidata nodes, collapsed to 17 top-level categories, ensuring broad concept coverage. The training distribution spans all domains, including fine-grained and long-tail NPs, and evaluation subsets span external domains and medical data.

A plausible implication is that categorical breadth and fine-grained prompt coverage synergistically support robust open-vocabulary evaluation and realistic assessment of semantic generalization across domains.

7. Significance and Prospects

SA-Co establishes a standardized, rigorous, and scalable framework for evaluating promptable concept segmentation in both images and videos, with extensive coverage of object categories and ontological granularity. By supporting zero-shot, few-shot, and synthetic domain adaptation evaluation, the benchmark enables assessment of PCS model generality and performance ceiling relative to human-level segmentation and recognition. The methodological advances in dataset generation, annotation protocols, metric design, and open-source tooling collectively position SA-Co as a central resource for advancing open-vocabulary segmentation research in computer vision (Carion et al., 20 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

SAM 3: Segment Anything with Concepts (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Segment Anything with Concepts (SA-Co) Benchmark.