SA-1B Dataset: Segmentation Benchmark

Updated 20 November 2025

SA-1B dataset is a large-scale computer vision benchmark featuring 11 million images and 1.1 billion segmentation masks designed for promptable segmentation.
It employs a three-stage annotation pipeline—assisted manual, semi-automatic, and fully automatic—to ensure high-quality and efficient mask generation.
The dataset emphasizes responsible AI with rigorous privacy enforcement, fairness auditing, and research-only licensing for global representation.

The SA-1B dataset is a large-scale computer vision benchmark comprising 1.1 billion segmentation masks annotated on 11 million privacy-protected images. Introduced by Kirillov et al. in "Segment Anything" (Kirillov et al., 2023), SA-1B constitutes the data foundation for the Segment Anything Model (SAM), enabling research into promptable, zero-shot segmentation and foundation model pretraining. Extensive data engine methodology, rigorous statistical auditing, and responsible AI analysis characterize its construction.

1. Dataset Scale, Sources, and Composition

SA-1B consists of the following key elements:

Images: 11,000,000 licensed photographs from a professional provider, with contracts directly with photographers. Original resolution averages ~3300×4950 px; released images are downsampled so the short side is 1500 px for distribution efficiency.
Segmentation Masks: 1,100,000,000 binary masks (average ≈100 masks/image). No semantic taxonomy is enforced; any object or stuff is eligible for annotation.
Licensing and Privacy: Images are distributed under a research-only Terms of Use, with mandatory agreement by users. All faces and license plates are blurred automatically using the RetinaFace model. Released data contains no captions, photographer names, or other identifying metadata.
Ethical Measures: Provider-level objectionable content filtering, user removal request mechanisms, and regional distribution goals (>200 countries represented).
Global Representation: While all major regions are covered, Africa and low-income regions remain underrepresented compared to middle/high-income regions, yet still account for tens of millions of masks.

2. Data Collection and Annotation Pipeline

SA-1B’s annotation—termed the "data engine"—follows a staged, iterative approach combining human and automatic curation:

Stage 1 (Assisted-Manual): 120,000 images annotated with 4.3M masks using a browser-based tool that incorporated an early-stage SAM. Annotation tools include foreground/background clicks, bounding boxes, and brush/eraser. Annotators labeled any object describable in natural language, with a per-mask annotation stop-criterion of 30 s. Annotation efficiency improved throughout: 34s/mask to 14s/mask as model iterations advanced (6.5× faster than COCO segmentation, ≈2× slower than extreme-point boxes).
Stage 2 (Semi-Automatic): 180,000 images. Class-agnostic box detectors trained on Stage 1 data generated prefilled (automatic) masks to seed human annotation. Annotator time per mask increased for challenging cases (34s/mask) with the total mask count (manual+automatic) growing.
Stage 3 (Fully Automatic): All 11M images. The ambiguity-aware SAM is prompted with a 32×32 regular grid of foreground points per image and returns up to 3 masks per point (~3,072 mask candidates/image). The selection pipeline applies successive filters: predicted IoU ≥ 0.88, mask stability (IoU >= 0.95 between thresholds at 0.5±δ), non-maximal suppression (IoU > 0.7 within crops), post-processing for size (>100 px), hole fill, and area (<95% image discarded). Small objects are handled with additional 2×2 or 4×4 cropped grid runs. Post-filtering, ≈100 masks/image are retained.

Human audit of 500 images (~50,000 masks) reported auto/corrected mask IoU: 94% >0.90, 97% >0.75. These values are consistent with or exceed prior inter-annotator agreement on COCO/LVIS/OpenImages (0.85–0.91 IoU).

3. Dataset Statistics and Structural Properties

SA-1B’s statistical and geometric properties reflect its scale and object diversity:

Statistic	Value/Description	Notes
Mean masks per image	≈ 100	Distribution: <50 (6%), 50–200 (57%), >500 (2%)
Mask area metric	$\sqrt{ \text{mask area} / \text{image area} }$	Heavier tails at small/medium than COCO/OpenImages
Mask complexity	Concavity = $1 - \frac{\text{mask area}}{\text{convex-hull area}}$	Matches LVIS, ADE20K when stratified by rel. size
Centroid distribution	Mask centers normalized in image coordinates.	Less center bias than COCO/OpenImages
National coverage	>200 countries; higher % from Asia, Europe/Oceania, middle-income countries	Africa/low-income underrepresented in total %

Relative to prior segmentation datasets, SA-1B is +11× images and +400× masks over OpenImages V5 (which has 1.74M images, 2.74M masks) (Kirillov et al., 2023).

Fairness characteristics are measured via zero-shot mask accuracy on MIAP (perceived gender/age) / Fitzpatrick skin tone subsets: with 1-point prompt, mIoU ≈ 53–57% across groups (overlapping confidence intervals); 3-point prompts achieve ≈90–92% mIoU.

4. Data Format, Organization, and Access

SA-1B is distributed in a directory structure with image and annotation files. Annotation files adhere to the COCO RLE format:

SA1B/
  images/
    00000001.jpg
    ...
  annotations/
    sa1b_images.jsonl      # image metadata (id, filename, width, height)
    sa1b_masks.jsonl       # masks per image (segmentation, bbox, area, confidence, stability)

Each "mask" object contains:

"segmentation" — COCO-style RLE: { "size": [h, w], "counts": "<binary>" }
"bbox" — Absolute [x, y, width, height] in pixels
"area" — Pixel count of the mask
"confidence" — Model-predicted IoU (float, 0–1)
"stability" — Boolean mask stability post-filtering

Data loading can be accomplished using pycocotools. Example (Python):

from pycocotools import mask as maskUtils
import json, cv2
line = json.loads(open("annotations/sa1b_masks.jsonl").readline())
img = cv2.imread(f"images/{line['image_id']:08d}.jpg")
for m in line["masks"]:
  rle = {"size": m["segmentation"]["size"], "counts": m["segmentation"]["counts"].encode("utf-8")}
  binary_mask = maskUtils.decode(rle)

The dataset (~5 TB downsampled images; ~20 GB JSONL annotations) is available for download by following procedures at segment-anything.com. Terms require compliance with research-use restrictions.

5. Quality Control, Fairness, and Responsible AI

Quality and ethical oversight are present at multiple stages:

Human correction audit: Random sampling followed by manual mask correction; matches or exceeds prior benchmarks on IoU consistency.
Automated mask filtering: Model-predicted IoU, stability criteria, and redundant mask suppression ensure mask fidelity.
Privacy enforcement: RetinaFace blurring, exclusion of captions or PII, ban on re-identification.
Geographic/fairness reporting: Crossing mIoU by gender, age, and skin tone with COCO-level benchmarks; all subgroup intervals overlap (no large disparities).
Usage restrictions: Explicit prohibition on de-anonymization, commercial use, or re-distribution of images.

6. Research Usage and Significance

SA-1B enables wide-ranging research in promptable segmentation models, foundation model pretraining, and large-scale evaluation of fairness and global representation. Its release with promptable SAM (open-source under Apache 2.0) allows methodological replication and extension for both academic and industrial purposes, subject to research-only dataset terms.

Key dataset parameters—mask quantity, annotation pipeline efficiency, automatic quality controls, and rich object granularity—establish SA-1B as a scale and diversity benchmark, exceeding existing resources by wide margins and informing best practices in dataset construction, privacy engineering, and large-scale annotation validation (Kirillov et al., 2023).

PDF Markdown Chat (Pro)

References (1)

Segment Anything (2023)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to SA-1B Dataset.