SA-Co Benchmark: Open-Vocabulary Concept Segmentation
- The benchmark introduces SA-Co, a large-scale framework that evaluates promptable concept segmentation using noun phrases, exemplars, and mixed prompts in both images and videos.
- SA-Co employs a multi-phase data construction and annotation pipeline with rigorous protocols, delivering evaluations based on advanced metrics like cgF1, pHOTA, and IoU.
- The benchmark advances computer vision research by enabling zero-shot, few-shot, and synthetic domain adaptation evaluations, demonstrating significant performance gains over previous baselines.
Segment Anything with Concepts (SA-Co) is a large-scale, open-vocabulary benchmark for Promptable Concept Segmentation (PCS) in images and videos, introduced in conjunction with SAM 3 ("Segment Anything Model 3"). PCS tasks require models to detect, segment, and track all instances of user-specified concepts, defined via short noun phrases, image exemplars, or their combination, assigning consistent identities to these instances across frames. SA-Co enables rigorous evaluation of open-vocabulary recognition, instance segmentation quality, and tracking accuracy, supporting noun-phrase, exemplar, and mixed prompt formats.
1. Benchmark Definition and Scope
SA-Co is designed to quantify model performance on PCS tasks, where the input prompt can be a noun phrase (e.g., "striped cat"), an image exemplar (positive/negative crop), or both. The benchmark mandates segmentation of every matching instance in the provided media and unique ID assignment for temporal consistency in video. Supported objectives comprise:
- Open-vocabulary recognition (presence detection)
- Instance segmentation boundary quality
- Instance tracking accuracy in video settings
SA-Co supports prompt types: (1) noun phrase, (2) exemplar, (3) mixed.
2. Dataset Construction and Statistics
The SA-Co dataset is produced via a multi-phase, scalable data engine pipeline:
- Phase 1: Mining image-caption-derived noun phrases → mask proposals (SAM 2 + OWLv2) → human Mask Verification (MV) and Exhaustivity Verification (EV) → manual corrections, yielding the HQ image set (4.3M image-NPs).
- Phase 2: Replacement of human MV/EV by Llama-based AI verifiers; introduction of hard negative generator (ontology + MLLM sourcing) filtered by spurious mask triggers, adding 122M image-NPs.
- Phase 3: Expansion to 15 visual domains, adoption of a fully AI synthetic verification mode ("SYN"), contributing 19.5M image-NPs.
- Phase 4 (video): Scene/motion-based mining, SAM 3 pseudo-masks, and manual annotation to form 52.5K videos, 134K video-NP pairs, and 467K masklets.
Final splits are:
| Split | Images/Videos | Concept Labels | Masks |
|---|---|---|---|
| HQ | 5.2M images | 4M unique NPs | 52M |
| SYN | 39M images | 1.7B image-NPs | 1.4B |
| EXT | 9.3M images | 15 datasets | 70M |
| VID | 52.5K videos | 24.8K NPs | 467K |
Evaluation subsets encompass ~207K unique NPs, 121K media, and 3M media-NPs, with specific splits for images (Gold, Silver, Bronze, Bio) and videos (SA-V, YT-Temporal-1B, SmartGlasses). Ontologically, 22.4M Wikidata nodes are mapped to 17 top-level categories (e.g., animals, human, tools), with split distributions documented in SAM 3.
3. Annotation Procedures and Evaluation Protocols
Image Annotation
- Mask Verification (MV): Accept/reject candidate triplet (image, NP, mask).
- Exhaustivity Verification (EV): Verify that accepted masks exhaustively cover all instances.
- Correction: Manual refinement for exhaustive mask annotation.
- Gold splits are labeled by three independent experts to estimate human upper-bound accuracy.
Video Annotation
- Detection per frame, propagation through SAM 3 tracker, and matching of new detections.
- Masklets unconfirmed for a delay (T=15 frames) are discarded based on Masklet Detection Score (MDS).
- Masklet suppression is performed if MDS falls below zero over the lifespan.
- Tracker re-anchoring is enforced at regular intervals (N=16 frames).
Metrics
- Localization: Micro-F1 across IoU thresholds
- Image-level Classification: Matthews Corr. Coeff. (IL_MCC)
- Combined:
- Video: cgF₁, volume-IoU, plus pHOTA (HOTA on video-NP pairs), TETA
- Standard benchmarks: AP for box/mask detection (COCO, LVIS), MAE/Accuracy for counting, J/G for VOS.
Split Protocols
- Image QA splits are zero-shot (no training overlap).
- Few-shot adaptation possible (1–10 examples) for select datasets.
- Video evaluation is performed without fine-tuning on evaluation sets.
4. Comparative Baseline Performance and Ablation Studies
Image-Level
- On SA-Co Gold: Best prior baseline achieves , SAM 3 achieves $1 = 54.1$ (+2× increase, 74% of human).
- Similar performance gains on Silver/Bronze/Bio splits.
- Exemplar prompt on COCO/LVIS/ODinW13: T-Rex2 best prior ; SAM 3+exemplar (+18.3).
Video-Level
- Text-prompted tracking (SA-V, YT-Temporal, SmartGlasses): SAM 3 (> baseline).
- On LVVIS, BURST, YTVIS, OVIS: SAM 3 $36.3$ mAP vs. $20.8$ for baselines (+75% rel).
Other Visual Segmentation Tasks
- VOS (MOSEv2): SAM 3 vs. prior best ( improvement).
- Interactive image segmentation: 5-click (SAM 3) vs. $84.3$ (SAM 2.1).
Counting Tasks
- CountBench/PixMo-Count: SAM 3 with IoM-NMS, MAE , accuracy ; prior best accuracy .
Ablations
- Presence head increases $1$ by ().
- Hard negatives unlock (IL ).
- Data subset analysis: EXT+SYN+HQ $1=47.4$; without HQ $1=23.7$; with SYN only $1=32.8$.
- AI verifier improvements: EV AI , MV AI , closing half gap to human annotation.
Domain Adaptation
Holding out "Food&Drink" domain and fine-tuning on synthetic only (SYN-Food) matches HQ-Food performance at large scale without human annotation.
5. Data Access and Benchmark Usage
Annotations and evaluation code are publicly available at https://github.com/facebookresearch/sam3. Annotation format is federated JSON containing media_id, phrase, masks, and hard_negatives.
Benchmarking Process:
- Run model on each (media, phrase) pair, outputting mask polygons and confidence scores.
- Pack results into SAM 3 JSON format.
- Evaluate results using provided scripts:
1 2
python eval_image.py --pred sampreds.json --gt saco_gold.json --metric cgF1 --threshold 0.5 python eval_video.py --pred vidpreds.json --gt sacovideo.json --metrics cgF1 pHOTA TETA --iou_thresh 0.5
Video tracking pseudocode:
1 2 3 4 5 6 7 8 9 |
prev_masklets = [] for t, frame in enumerate(video): dets = Detector(frame, prompt) props = Tracker.propagate(prev_masklets) matched, unmatched = match(props, dets, iou_thresh=0.3) masklets = suppress_unconfirmed(matched) + new_masklets(unmatched) if t % 16==0: masklets = re_prompt(masklets, dets) output_after_delay(masklets, delay=15) prev_masklets = masklets |
6. Category Coverage and Ontological Structure
SA-Co employs an ontology comprising 22.4M Wikidata nodes, collapsed to 17 top-level categories, ensuring broad concept coverage. The training distribution spans all domains, including fine-grained and long-tail NPs, and evaluation subsets span external domains and medical data.
A plausible implication is that categorical breadth and fine-grained prompt coverage synergistically support robust open-vocabulary evaluation and realistic assessment of semantic generalization across domains.
7. Significance and Prospects
SA-Co establishes a standardized, rigorous, and scalable framework for evaluating promptable concept segmentation in both images and videos, with extensive coverage of object categories and ontological granularity. By supporting zero-shot, few-shot, and synthetic domain adaptation evaluation, the benchmark enables assessment of PCS model generality and performance ceiling relative to human-level segmentation and recognition. The methodological advances in dataset generation, annotation protocols, metric design, and open-source tooling collectively position SA-Co as a central resource for advancing open-vocabulary segmentation research in computer vision (Carion et al., 20 Nov 2025).