COCO Data Archive Overview

Updated 30 January 2026

COCO Data Archive is a comprehensive collection of datasets for object detection, segmentation, captioning, 3D reconstruction, and optimization tasks.
It integrates MS-COCO, 3D-COCO, and bbob-biobj with rigorous annotation protocols, standardized schemas, and evaluation metrics.
The archive enables reproducible research by providing programmatic access, consistent benchmarks, and support for multimodal AI methodologies.

The COCO Data Archive comprises a set of distinct but related resources widely utilized in computer vision and optimization, each tailored to specific benchmarking, annotation, or multimodal evaluation tasks. Three principal datasets constitute this archive: the Microsoft COCO (Common Objects in Context) object detection and segmentation dataset, the 3D-COCO extension for multimodal 2D–3D research, and the COCO Platform for bi-objective black-box optimization benchmarking. Each resource exhibits rigorous data organization, precise evaluation practices, and standardized interfaces, enabling reproducible research across detection, captioning, 3D understanding, and optimization.

1. Dataset Organization and Content

The primary MS-COCO dataset contains over 330,000 images with detailed instance-level object annotations covering 80 everyday categories. Subsets include train2014 (82,783 images), val2014 (40,504 images), and test2015 (40,775 images). Annotations are available as per-image segmentation masks, bounding boxes, and image-level metadata. The MS-COCO Captions corpus augments this with approximately 1 million human-annotated captions, yielding 5 captions per image for all splits, and 40 captions each for a 5,000 image held-out set (“c40”) (Chen et al., 2015). Annotation files follow a uniform JSON schema with explicit linking via image_id and annotation_id fields.

3D-COCO extends MS-COCO by providing 27,760 aligned 3D CAD models spanning all 80 COCO classes, sourced primarily from ShapeNet (26,254 models, 22 classes) and Objaverse (1,506 models, 58 classes). Each object annotation in MS-COCO is matched to the top-3 3D models of its class through a viewpoint-maximized Intersection-over-Union (IoU) criterion, recorded in annotations_3d_train2017.json or annotations_3d_val2017.json (Bideaux et al., 2024).

The COCO “bbob-biobj” platform targets benchmarking of bi-objective black-box optimizers, storing the full trajectory of optimization runs. Its data archive is hierarchically structured by objective functions, dimensions, instances, and algorithms. Standard file types include raw evaluation logs (*.dat), improvement records (*.rdata), and performance logs for target achievement (*.info) (Brockhoff et al., 2016).

2. Data Access, Formats, and Schemas

Each COCO dataset prescribes concrete download locations and directory layouts for images, annotations, and metadata. Example MS-COCO annotation directory after archive extraction:

/train2014/…jpg
/val2014/…jpg
/annotations/
  captions_train2014.json
  captions_val2014.json
  image_info_test2015.json

3D-COCO preserves MS-COCO’s image structure under \<3D-COCO_ROOT>/images/, while introducing \<3D-COCO_ROOT>/models/ (CAD meshes, .obj; voxel .binvox; point cloud .ply; multi-view renders as PNG) and \<3D-COCO_ROOT>/annotations/ for 2D–3D alignment (Bideaux et al., 2024). Its JSON annotation schema for an aligned object instance includes:

{
  "image_id": 123456,
  "annotation_id": 7890,
  "category_id": 3,
  "bbox": [x, y, width, height],
  "model_matches": [
    {"model_id": "...", "source": "ShapeNet", "best_view": 17, "iou_score": 0.78},
    ...
  ],
  "flags": {
      "is_small": false, "is_crowd": false,
      "is_truncated": false, "is_occluded": true, "is_divided": false
  },
  "camera_intrinsics": {"fx": 1500.0, "fy": 1498.6, "cx": 640.5, "cy": 480.5},
  "scale_factor": 1.0
}

The “bbob-biobj” archive organizes results as follows:

data_bbob-biobj/
├── f01/
│   ├── DIM2/
│   │   ├── i01/
│   │   │   ├── alg__MyAlgo/
│   │   │   │   ├── f01_i01_DIM2_MyAlgo.dat
│   │   │   │   ├── f01_i01_DIM2_MyAlgo.info
│   │   │   │   └── f01_i01_DIM2_MyAlgo.rdata

Key columns and fields for each file type are explicitly specified (Brockhoff et al., 2016).

3. Annotation, Alignment, and Matching Methodology

3D-COCO employs an IoU-based model selection protocol for 2D–3D alignment. For each COCO annotation, the binary mask $M_{2D}$ is computed at $224\times224$ resolution. Each candidate CAD model is rendered from 62 uniformly sampled viewpoints, yielding silhouettes $\{M^{i}_{3D}\}$ . The IoU is given by:

$\text{IoU}(M_{2D}, M_{3D}) = \frac{\mathrm{area}(M_{2D} \cap M_{3D})}{\mathrm{area}(M_{2D} \cup M_{3D})}$

Model score is the maximum IoU over viewpoints, $s_\text{model} = \max_{i=1..62} \text{IoU}(M_{2D}, M^{i}_{3D})$ . The top 3 models per annotation are retained, along with their best alignments and IoU values. Additional flags identify small, truncated, crowd, occluded, and divided instances based on bounding box, mask, and neighborhood heuristics (Bideaux et al., 2024).

Caption annotations are collected via Amazon Mechanical Turk under strict protocols (minimum length, vocabulary, no proper names), with 5–40 captions per image (Chen et al., 2015).

The COCO bbob-biobj platform uses exact logging of all non-dominated points and the associated evaluation steps for reproducible computation of Pareto fronts and hypervolume indicators (Brockhoff et al., 2016).

4. Evaluation Metrics and Benchmarks

COCO object detection and segmentation tasks are not detailed in the referenced archive content, but MS-COCO Captions defines evaluation on corpus-level BLEU, METEOR, ROUGE-L, and CIDEr-D. These metrics operate as:

BLEU: n-gram clipped precision with brevity penalty.
METEOR: F-score of matches via exact, stem, synonym, and paraphrase, penalized by fragmentation.
ROUGE-L: Longest common subsequence-based recall/precision F-β score.
CIDEr-D: TF-IDF weighted n-gram similarity, with Gaussian length penalty and scaling.

COCO Caption evaluation is centralized on a public server (CodaLab), where official scores are computed and compared to human performance baselines (Chen et al., 2015).

3D-COCO provides data and benchmarks for use cases such as open-vocabulary detection, 3D-guided retrieval, single- and multi-view 3D reconstruction, and synthetic-to-real adaptation, but does not define standard evaluation leaderboards or reported baseline results (Bideaux et al., 2024).

The COCO bbob-biobj archive evaluates bi-objective optimizer performance by measuring hypervolume indicator achievement versus a normalized reference Pareto set,

$H(P) = -\mathrm{VOL}\left(\bigcup_{x\in P}[f(x), r]\right)$

across a fixed sequence of target precisions. Key metrics are:

Runtime $T$ , counted in function evaluations to reach target hypervolume.
Expected Running Time (ERT), averaging $T$ over multiple runs.
Empirical CDF (ECDF), fraction of target-runs solved versus function evaluations (Brockhoff et al., 2016).

5. Programmatic Access and Integration

COCO data maintains strong support for programmatic data access. Python APIs (pycocotools) are the standard method for loading image, instance, and caption annotations:

from pycocotools.coco import COCO
coco2d = COCO("instances_train2017.json")
coco3d = COCO("annotations_3d_train2017.json")
img_ids = coco2d.getImgIds()
for img_id in img_ids:
    anns2d = coco2d.loadAnns(coco2d.getAnnIds(imgIds=[img_id]))
    anns3d = coco3d.loadAnns(coco3d.getAnnIds(imgIds=[img_id]))
    # annotation_id matching to model_matches

For 3D-COCO, integration with COCO-compatible toolchains such as Detectron2 or MMDetection is possible by registering the 3D annotation JSONs identically to MS-COCO (Bideaux et al., 2024). Model files, voxelizations, and renderings are structured for batch loading in deep learning pipelines.

The bbob-biobj platform provides direct download and parsing utilities via the cocoex Python package. Hypervolume computations can be performed using the pygmo library:

1 2	from pygmo import hypervolume hv = hypervolume(obj).compute([1.1, 1.1])

6. Applications and Research Impact

The COCO Data Archive underpins research in diverse scenarios:

Detection and instance segmentation on natural images
Image caption generation and evaluation under linguistically realistic protocols
Multimodal retrieval and 2D–3D grounding, enabling connections between textual, visual, and 3D shape representations
Evaluation of bi-objective black-box optimizers with rigorous, reproducible metrics

3D-COCO uniquely enables research at the intersection of detection, 3D reconstruction, and multimodal query, including benchmark protocols for text-to-3D, synthetic-to-real transfer, and pose-aware detection (Bideaux et al., 2024). The explicit alignment of images and 3D models supports tasks such as single-view reconstruction and 6D pose estimation.

Captions evaluation with COCO-standard metrics remains a primary baseline for captioning models and generative vision-language architectures (Chen et al., 2015). The bbob-biobj archive has established itself as the canonical testbed for benchmarking evolutionary and stochastic multi-objective optimization methods (Brockhoff et al., 2016).

7. Data Provenance, Reproducibility, and Extensibility

The entire COCO Data Archive is organized to facilitate transparent benchmarking and reproducibility. All annotation files, model matching procedures, and evaluation metrics are openly documented. Where reference solutions (e.g., Pareto sets) are improved or errors corrected, performance indicators can be re-computed without retraining due to permanent logging of raw archives and improvement histories (Brockhoff et al., 2016). The open-source release of code and data for 3D-COCO and the bbob-biobj platforms allows researchers to extend annotation schemas, substitute evaluation logging, or incorporate additional data modalities as needed (Bideaux et al., 2024).

The design of each archive segment emphasizes forward compatibility and integration with community-standard pipelines, ensuring ongoing relevance to both benchmarking and methodological development.

Key References:

"3D-COCO: extension of MS-COCO dataset for image detection and 3D reconstruction modules" (Bideaux et al., 2024)
"Microsoft COCO Captions: Data Collection and Evaluation Server" (Chen et al., 2015)
"Biobjective Performance Assessment with the COCO Platform" (Brockhoff et al., 2016)

Markdown Upgrade to Chat

References (3)

Microsoft COCO Captions: Data Collection and Evaluation Server (2015)

3D-COCO: extension of MS-COCO dataset for image detection and 3D reconstruction modules (2024)

Biobjective Performance Assessment with the COCO Platform (2016)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to COCO Data Archive.