Microsoft COCO Captions Dataset Overview

Updated 9 February 2026

Microsoft COCO Captions Dataset is a large-scale, extensively annotated collection with over 1 million human-generated captions for benchmarking image captioning and vision-language models.
The dataset employs dual annotation regimes (c5 and c40) via rigorous crowdsourcing protocols, ensuring high-quality, diverse, and consistent captions across train, val, and test splits.
Benchmarking leverages multiple evaluation metrics (BLEU, METEOR, ROUGE-L, and CIDEr-D) and has fostered extensions like CxC and ECCV Caption to address limitations in traditional image-text matching.

The Microsoft COCO Captions dataset is an extensively annotated, large-scale resource for benchmarking image captioning and vision-LLMs, comprising over one million human-generated captions for more than 330,000 images. Developed as an extension of the original Microsoft COCO (Common Objects in Context) image dataset, it has become the de facto standard for end-to-end evaluation of image captioning algorithms, vision-language pretraining, and image-text retrieval systems. The dataset’s richness, annotation methodology, evaluation protocols, and subsequent extensions have spurred substantial progress and innovation in multimodal research domains (Chen et al., 2015, Parekh et al., 2020, Chun et al., 2022).

1. Dataset Structure and Annotation Protocol

The COCO Captions dataset is organized around the original COCO image splits:

Training set: 82,783 images
Validation set: 40,504 images
Test set: 40,775 images

Two main annotation regimes are employed:

c5: Five independent, crowd-sourced captions per image for all train, val, and test images.
c40: Forty captions for a random subset of 5,000 test images, primarily for metric calibration and human correlation studies.

Caption collection employed a rigorous protocol via Amazon Mechanical Turk (AMT). Instructions required workers to describe all salient aspects of the scene using a minimum of eight words, avoiding trivialities, speculative statements, or use of proper names. Post-collection, tokenization was performed using the Stanford PTBTokenizer without additional normalization such as stemming or lemmatization. The total number of captions exceeds 1,026,000, with all data available in standardized JSON formats (Chen et al., 2015).

Split	#Images	#Captions per Image	#Total Captions
Train	82,783	5	413,915
Val	40,504	5	202,520
Test (c5)	40,775	5	179,189
Test (c40)	5,000	40	200,060

2. Evaluation Methodology and Metrics

The dataset is accompanied by a public evaluation server, hosted on CodaLab, that ensures standardization and comparability across models. The server accepts prediction files in a specified JSON format and computes corpus-level metrics using the following algorithms:

BLEU (Bilingual Evaluation Understudy): $n$ -gram precision with brevity penalty, up to BLEU-4.
METEOR: Alignment via exact, synonym, stem, and paraphrase matches with chunk and fragmentation penalties.
ROUGE-L: Longest common subsequence-based F-score.
CIDEr-D: TF-IDF-weighted $n$ -gram similarity penalizing length deviations and favoring consensus with references.

All automatic metrics use precise, documented equations as implemented in the official COCO Caption codebase. The evaluation protocol enforces use of held-out reference captions for the test split to avoid contamination and preserve the benchmark’s integrity (Chen et al., 2015).

3. Known Limitations and Motivations for Extension

The original MS-COCO Captions corpus defines the positive image-caption pairings rigidly: each image is associated with only its five reference captions, and each caption with a single image. All other possible pairings are treated as negatives, even when a caption could plausibly describe multiple images exhibiting similar contexts, or conversely, when an image is depicted by near-synonymous captions for other related images. This practice leads to substantial “false negatives” in image-text matching evaluation—estimated at 3.6× more true image-to-caption links and 8.5× caption-to-image links than originally annotated. The sparsity of positives per query has compelled the field to focus on Recall@K metrics, which inadequately reward models for retrieving plausible but officially unrecognized matches and disregard the full ranked ordering of relevant results (Chun et al., 2022, Parekh et al., 2020).

4. Notable Extensions and Augmented Benchmarks

Several prominent extensions have addressed these limitations:

a. Crisscrossed Captions (CxC):

CxC augments the MS-COCO validation and test splits with dense human semantic similarity judgments for 267,095 image-image, caption-caption, and caption-image pairs. Pairs are continuously rated [0.0–5.0] by multiple annotators using an extended Semantic Textual Similarity (STS) protocol, enabling nuanced evaluation of (i) intra-modal and (ii) inter-modal semantic alignments. This allows models to be trained and evaluated for retrieval and similarity correlation tasks in a unified, multi-task setup, exposing hidden positives and supporting vision-only and text-only retrieval benchmarks (Parekh et al., 2020).

b. ECCV Caption (Extended COCO Validation Caption):

To systematically correct false negatives, ECCV Caption employs a machine-in-the-loop procedure: five pretrained image-text matching (ITM) models propose candidate pairings, which are then subjected to crowdsourced validation (Amazon Mechanical Turk). The resulting test split contains, on average, 17.9 validated captions per image and 8.5 images per caption, a dramatic increase over the baseline (5 and 1, respectively). ECCV Caption is distributed as a drop-in replacement for the test set, facilitating richer and more accurate evaluation. It introduces the mean Average Precision at R ( $\mathrm{mAP}@R$ ), a retrieval metric that rewards full ranking of all true positives, demonstrated via user study to better correspond with human judgments than Recall@K. The release includes annotation splits, association files, and scripts for mAP@R, Recall@K, and R-Precision (Chun et al., 2022).

Benchmark	Avg. captions per image	Avg. images per caption
Original COCO	5.0	1.0
ECCV Caption	17.9	8.5

5. Advances in Modeling and Empirical Findings

The proliferation of dense and graded ground-truth associations has enabled more rigorous empirical study. Recent reevaluations of 25 vision-LLMs—including VSE-style (VSE0, VSE++, PVSE, PCME), region-based (VSRN, CVSE), large-scale vision-language pretraining (CLIP, VinVL, BLIP), and triplet-mining variants—reveal that existing ranking metrics (Recall@1/5/10) are highly correlated across standard and extended benchmarks (Kendall $\tau > 0.87$ ). However, percentile-based metrics like $\mathrm{mAP}@R$ from ECCV Caption can produce substantially different model orderings and highlight overfitting to hard-negative mining strategies. Some models that outperform others by Recall@1 underperform under $\mathrm{mAP}@R$ and vice versa. The use of multiple diverse ITM models in the ECCV Caption annotation loop effectively neutralizes bias toward any single model architecture (Chun et al., 2022).

6. Additional Modalities and Multimodal Applications

The original dataset has been further extended into new modalities to foster research in spoken-language and cross-lingual multimodal learning. For example, SPEECH-COCO synthesizes over 600 hours of spoken captions corresponding exactly to the existing training and validation splits, introducing controlled inter- and intra-speaker variability, disfluency simulation, and precise word/syllable/phoneme alignment metadata. This enables investigation of visually grounded spoken term discovery, cross-modal representation learning with speech, and simulating low-resource scenarios. Such augmentations preserve the core dataset's structure while dramatically increasing its applicability (Havard et al., 2017).

7. Access, Licensing, and Usage Guidance

All core COCO Captions data and major extensions are maintained as open resources. The images retain their original Creative Commons licenses, while captions are released under cc-by or cc-by-sa (for speech synthesis, cf. SPEECH-COCO). Reference and submission JSON schemas are public, and evaluation scripts are extensively documented. Dataset download, extensions (CxC, ECCV Caption, SPEECH-COCO), and evaluation utilities are available at their respective repositories:

COCO Captions: http://mscoco.org
CxC: https://github.com/google-research-datasets/Crisscrossed-Captions
ECCV Caption: https://github.com/naver-ai/eccv-caption
SPEECH-COCO: https://zenodo.org/record/4282267

Adherence to original test-server data restrictions remains mandatory for all derivative evaluations (Chen et al., 2015, Havard et al., 2017, Parekh et al., 2020, Chun et al., 2022).