COCO-Stuff: Semantic Segmentation Dataset

Updated 17 June 2026

COCO-Stuff is a large-scale dataset with pixel-wise annotations for 171 semantic categories, covering both countable objects and amorphous background regions.
It employs a hierarchical taxonomy and a superpixel-based annotation protocol that accelerates labeling while ensuring high boundary accuracy.
The dataset supports closed-set and open-vocabulary segmentation benchmarks, providing robust metrics and insights for advancing scene context and computer vision research.

COCO-Stuff is a large-scale semantic segmentation dataset augmenting the COCO-2017 corpus with dense pixel-wise annotations for “stuff” categories—amorphous background regions such as grass, sky, or pavement—alongside its preexisting “thing” object annotations. It presents a comprehensive resource for studying scene-level context, spatial relationships, and advances in both closed and open-vocabulary segmentation tasks. COCO-Stuff introduces a rigorous labeling protocol, a well-curated and hierarchically organized classes taxonomy, and benchmarks suited for both conventional and open-set semantic segmentation research (Caesar et al., 2016, Chen et al., 21 Apr 2026).

1. Dataset Composition and Semantic Taxonomy

COCO-Stuff comprises 164,000 images (COCO 2017) annotated at the pixel level for 171 semantic categories:

Thing classes: 80 (identical to COCO’s object categories; countable objects such as “car” or “person” with well-defined shapes and part structure).
Stuff classes: 91 (v1.0; e.g., “grass,” “sky,” “marble floor,” etc.), expanded to 171 (v1.1) including fine-grained amorphous regions, plus an “unlabeled” background class.

Stuff classes are defined as amorphous, texture-rich image regions without clearly delineated instances, in contrast to “things.” To enforce mutual exclusivity and label granularity, labels are manually curated based on a four-level hierarchy:

Level 1: {stuff, thing}
Level 2: for stuff, a binary indoor/outdoor distinction
Level 3: super-categories (e.g., floor, textile)
Level 4: fine-grained leaf categories (e.g., “moss,” “wood floor”)

Class frequency statistics (train+val, 123K images) highlight that stuff covers 69.1% of pixels, while things cover 30.9%. The dataset exhibits a long-tailed per-class pixel-frequency distribution for both stuff and thing categories. In textual captions, “stuff” accounts for 38.2% of noun mentions, underscoring the semantic importance of background regions in scene descriptions (Caesar et al., 2016). In v1.1 and later, 5,000 images are densely annotated for all 171 stuff categories, supporting open-vocabulary segmentation research (Chen et al., 21 Apr 2026).

2. Annotation Protocol and Quality Analysis

Annotation relies on a superpixel-based pipeline that substantially accelerates labeling without significant loss in boundary accuracy:

~1,000 SLICO superpixels are precomputed per image for boundary-aware, uniform superpixel segmentation.
Existing COCO “thing” masks are overlaid, locking object pixels and constraining stuff annotations to background.
Annotators assign stuff labels by “painting” superpixels using adjustable brushes, rapidly covering amorphous regions.
The union of preserved thing outlines and newly labeled stuff superpixels yields the final map.

Benchmarking annotation methods (10 images, single annotator):

Freedraw (pixel-accurate, baseline, 1.0× speed): 100% accuracy (reference).
Polygonal annotation: 1.5× faster, 97.3% agreement with pixel-accurate reference.
Superpixel annotation: 2.8× faster, 96.1% agreement.
Human consistency (self-agreement): ~96–98% across modalities; superpixel errors are within this range.

Annotation time scales linearly with image boundary complexity (C, fraction of pixels bordering a different class). Superpixel mode reduces the annotation time slope by factors of ~2–3.4 over polygons/freedraw. Mask overlays further reduce boundary pixels (~46.8%) and accelerate annotation. Multi-annotator agreement (30 images, 3 labelers) is 73.6%, substantially higher than the 66.8% reported for ADE20K (Caesar et al., 2016).

3. Contextual Relationship Analysis

COCO-Stuff provides unparalleled infrastructure for analyzing spatial and semantic context:

For each semantic class, connected regions are aggregated.
Non-class pixels are analyzed in polar coordinates (distance, angle) relative to region centroids.
3D histograms (r, θ, l) index spatial label co-occurrence and frequency.

Principal findings:

Strong vertical support: e.g., “sky” above “car,” “road” below “car,” “tiled wall” above “tiled floor.”
Lateral thing–thing adjacency: e.g., “person” in front of “TV.”
Classes differ in contextual entropy (entropy defined as $- \sum_l P(l|r,θ) \log_2 P(l|r,θ)$ ): high entropy for “person” (3.40 bits) and “wood,” low entropy for “snowboard” surrounded by “snow.”
On average, stuff classes have higher contextual entropy (3.40 bits) than thing classes (3.02 bits), indicating greater spatial context diversity.

This systematic quantification of contextual priors is unique to COCO-Stuff and provides strong support for context-aware segmentation and reasoning models (Caesar et al., 2016).

4. Semantic Segmentation Benchmarks

COCO-Stuff supports both conventional closed-set and open-vocabulary segmentation challenges.

Closed-Set Experiments:

DeepLab V2 (VGG-16 backbone, pretrained on ILSVRC) evaluated on COCO-Stuff shows improvement with scale:

With 1,000–118,000 training images, mean IoU increases from 15.9% to 33.2%.
Things are significantly easier to segment than stuff (mean IoU: 43.6% for things versus 24.0% for stuff, full training set).
Performance metrics include pixel accuracy, class accuracy, mean IoU, and frequency-weighted IoU.

Metric	1K imgs	5K imgs	Full (118K)
Pixel Acc.	46.1%	52.7%	63.6%
Mean IoU	15.9%	23.1%	33.2%

Open-Vocabulary (COCO-S) Experiments:

COCO-Stuff v1.1 (171 classes, 5,000 images, no background) enables benchmarking for contemporary prompt-based segmentation models:

The “COCO-S” protocol evaluates all 171 categories.
CoCo-SAM3, building on the SAM3 model, applies synonym aggregation for intra-class enhancement and unified inter-class competition, outperforming previous methods without any extra training.
CoCo-SAM3 achieves a mean IoU of 43.6% on COCO-S, compared to 33.3% for vanilla SAM3 (Chen et al., 21 Apr 2026).

Method	COCO-S mIoU (%)
CorrCLIP	34.0
ReME	33.3
SAM3	33.3
CoCo-SAM3	43.6

5. Open-Vocabulary and Concept Conflict Segmentation

COCO-Stuff exposes challenges central to open-vocabulary segmentation research:

Fine-grained inter-class confusion arises from a dense, amorphous category space (e.g., “grass” vs. “vegetation”).
No fixed class list: synonymy and dataset-specific naming conventions cause intra-class drift.
Achieving pixel-level mutual exclusion among hundreds of open-vocabulary categories is difficult without explicit inter-class calibration.

CoCo-SAM3 addresses these by:

Aggregating synonym evidence using dense Perception Encoder features, temperature-scaled LogSumExp pooling, and pixel-level cosine similarity.
Enforcing inter-class competition via a fused logit-based score:

$S_c(x) = \ell_c(x) + \lambda_{\mathrm{prior}} \log(\pi_c(x)) + z_c$

where $\ell_c(x)$ is structural evidence, $\pi_c(x)$ is the semantic prior from synonym aggregation, $z_c$ is the global presence logit, and $\lambda_{\mathrm{prior}} = 0.7$ .

Final segmentation is obtained via pixel-wise argmax over fused scores.

Qualitative improvements include suppressing overlapping masks for semantically adjacent stuff classes and stabilizing segmentation in the presence of synonymic prompt variation (Chen et al., 21 Apr 2026).

6. Practical Implications and Usage Guidelines

COCO-Stuff’s dense, hierarchical annotations enable:

Context-aware object detection, leveraging stuff priors for refining proposals.
Scene layout estimation and material/geometric reasoning, based on well-characterized stuff labels.
Vision–language research exploiting aligned captions and pixel maps.
Comparative evaluation of annotation protocols: superpixel-based strategies deliver near-pixel fidelity with ~3× speedup.

Recommended usage guidelines include leveraging the label hierarchy for analysis, accounting for the 6% “unlabeled” pixels in metrics, and using large-scale training to optimize generalization. Annotation protocol insights are applicable to the design of new semantic segmentation datasets.

7. Significance within the Vision Research Community

COCO-Stuff integrates breadth (comprehensive “thing” and “stuff” labeling), hierarchical taxonomy, efficient annotation methodology, detailed contextual priors, and robust open-set evaluation infrastructure. It has established new standards for studying dense semantic segmentation, scene context, and open-vocabulary vision-language grounding, and continues to be an authoritative benchmark for both modeling and dataset development efforts in computer vision (Caesar et al., 2016, Chen et al., 21 Apr 2026).

Markdown Report Issue Upgrade to Chat

References (2)

COCO-Stuff: Thing and Stuff Classes in Context (2016)

CoCo-SAM3: Harnessing Concept Conflict in Open-Vocabulary Semantic Segmentation (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to COCO-Stuff.