Segment Anything 1B Dataset Overview

Updated 24 February 2026

SA-1B is a large-scale dataset containing over 1.1 billion binary masks on 11 million natural images, created via a model-assisted annotation pipeline.
It employs a three-stage process—assisted manual, semi-automatic, and fully automatic—to enhance accuracy and efficiency in segmentation annotation.
Designed for zero-shot segmentation, SA-1B supports promptable models like SAM and offers scalable training options for diverse computer vision tasks.

The Segment Anything 1 Billion (SA-1B) dataset is a large-scale resource for image segmentation, containing over 1.1 billion binary masks annotated on 11 million licensed and privacy-respecting natural images. Developed in conjunction with the Segment Anything Model (SAM), SA-1B is intended to support promptable foundation models for computer vision, with explicit design for zero-shot transfer to new image distributions and segmentation tasks. The dataset is constructed through an iterative, model-assisted data engine pipeline and released for research use with extensive quality control and validation procedures (Kirillov et al., 2023).

1. Dataset Construction Pipeline

The creation of SA-1B follows a three-stage data engine loop combining model-in-the-loop annotation tools and automated filtering mechanisms. Each stage employs progressively less human oversight as model performance and coverage improve.

Stage 1: Assisted-manual Annotation

Annotators utilized an interactive tool powered by an early version of SAM. Annotation involved foreground/background clicks and brush/eraser strokes to delineate both "things" and "stuff," without semantic constraints. The model was periodically retrained six times as additional data was collected, and encoder scale increased from ViT-B to ViT-H. Annotation efficiency improved, reducing mean per-mask time from 34 seconds to 14 seconds. This stage produced roughly 4.3 million masks across 120,000 images.

Stage 2: Semi-automatic Annotation

A generic box detector, trained on stage-1 masks, generated high-confidence object proposals. Annotators verified or further annotated only the remaining objects. The model was retrained five times during this stage, with annotator time per mask consistently around 34 seconds. This resulted in an additional 5.9 million masks on 180,000 new images, for a cumulative total of 10.2 million masks on 300,000 images.

Stage 3: Fully Automatic Mask Generation

The final ambiguity-aware SAM model generated masks fully automatically, proposing masks at dense grid points over images and multi-scale crops. No human annotation was involved in this stage. A series of quality filters and deduplication steps were applied, yielding the final dataset of 1.1 billion masks over 11 million images.

Automatic mask generation in Stage 3 follows these steps:

Apply a 32×32 grid over the image, plus finer 16×16 and 8×8 grids over 2×2 and 4×4 crops, respectively.
For each grid point, SAM predicts up to 3 mask proposals with corresponding IoU confidences.
Retain masks with predicted IoU exceeding $t_\text{iou\_pred} = 0.88$ , that cover less than $t_\text{area\_max} = 0.95$ fraction of image area, and pass the stability test ( $t_\text{stability} = 0.95$ ).
Apply non-maximum suppression with an IoU threshold of 0.7, remove connected components smaller than 100 pixels, and fill holes smaller than 100 pixels.
The stability criterion checks if a mask is robust to small changes in threshold, accepting $m$ if $\operatorname{IoU}(B_-, B_+) \geq t_\text{stability}$ , where $B_-$ and $B_+$ are thresholded masks shifted by $\Delta$ .

2. Annotation Format and Mask Properties

Each mask in SA-1B is stored as a binary per-pixel bitmask $M(x, y)$ , either as a PNG image or as run-length encoding (RLE), consistent with COCO and LVIS conventions. Masks are accompanied by metadata specifying image and mask identifiers, predicted IoU scores, source grid or crop, and confidence rankings.

Mathematical properties of each mask include:

Mask area: $A(m) = \sum_{x,y} M(x, y)$
Relative mask size: $s(m) = \sqrt{A(m)/A(\text{image})}$
Concavity (complexity): $\text{concavity}(m) = 1 - (A(m) / A(\text{convex hull}(m)))$
Approximate boundary length: $L(m) = \sum_{x, y} \left|\partial M(x, y)\right|$

During the annotation pipeline, SAM acts as a promptable segmentation model, supporting point, box, and previous-mask prompts. Each prompt yields up to $K = 3$ output masks, and for training, the mask with minimum loss is selected.

3. Dataset Statistics and Geographic Coverage

The SA-1B dataset comprises:

Images: 11,000,000, each downsampled so the shortest side is 1500 pixels.
Masks: 1,100,000,000, with 99.1% produced fully automatically.
Average masks per image: approximately 100.

Images are licensed from a large pool of photographers, with privacy measures ensuring that faces and license plates are blurred. Country assignment is inferred from captions using named entity recognition, and most countries are represented by more than 1,000 images. The dataset's economic distribution is balanced in middle and high-income bands, but low-income regions are under-represented.

Analysis of spatial distribution demonstrates reduced photographer center bias compared to COCO or Open Images, with better coverage of image corners. Relative mask size histograms indicate a larger fraction of small-to-medium segmentation masks, supporting robust training at diverse scales. Concavity distributions align broadly with ADE20K and LVIS when stratified by size.

SA-1B is available for research under a custom terms-of-use agreement. Annotation code and the SAM model are released under the Apache 2.0 license.

4. Splitting Strategies and Research Usage

No predefined train/validation/test split is provided, except for 2,000 images withheld for internal testing. The recommendation for default research use is to employ the full set of 11 million images for model training.

For computationally restricted scenarios, random sampling of 1 million (yielding roughly 100 million masks) or 100,000 images (about 10 million masks) achieves nearly full training performance. Specifically, experiments show that using 1 million images (10% of the data) recovers approximately 99.5% of the model’s transfer performance for point-prompted segmentation.

SA-1B supports a variety of research use cases:

Zero-shot interactive segmentation using point and box prompts.
Automatic annotation of other datasets.
Compositional use with detectors for instance segmentation.
Zero-shot edge detection and object proposal generation.
Text-to-mask prompting and downstream finetuning of SAM or decoding its embeddings.

5. Quality Assurance and Evaluation Metrics

A sequence of filters governs the acceptance of automatically generated masks:

Predicted IoU threshold: $t_\text{iou\_pred} = 0.88$
Stability threshold: $t_\text{stability} = 0.95$
Maximum area: Reject masks covering at least 95% of the image.
Non-maximal suppression: IoU threshold of 0.7 both within and across crops.
Postprocessing: Remove connected components smaller than 100 pixels; fill holes smaller than 100 pixels.

Human validation involved professional annotators correcting 50,000 masks on a sample of 500 images. The resulting matched pairs achieved 94% of auto-vs-corrected IoU greater than 90%, and 97% greater than 75%. Previous work indicates that inter-annotator consistency ranges from 85–91% IoU.

Several evaluation metrics are used for zero-shot segmentation transfers:

Mean Intersection-over-Union (mIoU): Predicted versus ground truth mask after N prompts.
Human perceptual rating: Quality of segmentation on a 1–10 scale.
Edge detection metrics on BSDS500: ODS, OIS, AP, R50.
Object proposal metrics on LVIS: Recall at 1,000 proposals.
Instance segmentation average precision (AP): Measured on COCO/LVIS boxes.

6. Limitations and Prospects for Extension

Several limitations and dataset biases are documented:

Missed fine structures such as eyelashes or wires.
Occasional hallucination of small disconnected mask regions.
Real-time design prioritizes speed over maximum IoU in interactive use; alternative interactive methods can surpass SAM when abundant clicks are provided.
Domain gaps remain for highly stylized or uncommon visual contexts.
Under-representation of low-income regions and certain subjects.
Minor fairness disparities exist for clothing segmentation by perceived gender (with single-point prompts).

Proposed future extensions include:

Annotating SA-1B masks with semantic class tags.
Investigating panoptic or amodal mask prediction using specialized prompts.
Incorporating language supervision through joint text-image pre-training.
Iterative improvement via continual learning with user corrections.
Building smaller, faster models for resource-constrained applications.
Temporal extension to video by linking per-frame SAM predictions.

SA-1B constitutes the largest released segmentation dataset to date, produced through a scalable model-in-the-loop process, with robust validation to power general-purpose, promptable segmentation models and facilitate broader research in zero-shot transfer and segmentation foundations (Kirillov et al., 2023).

Markdown Report Issue Upgrade to Chat

References (1)

Segment Anything (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Segment Anything 1 Billion (SA-1B) Dataset.