3D-COCO: Aligned 2D–3D Dataset

Updated 20 February 2026

3D-COCO is a large-scale dataset that integrates MS-COCO images with 28K curated 3D CAD models across 80 categories for direct 2D–3D alignment.
The dataset employs an IoU-based silhouette matching protocol using 62 canonical renders per model to select the top-3 best-fit CAD representations.
It supports research in single-view 3D reconstruction, cross-modal retrieval, and hybrid 2D–3D detection by combining image cues with explicit 3D shape information.

3D-COCO is a large-scale extension of the MS-COCO dataset that augments 2D object detection imagery with aligned 3D CAD models, providing explicit 2D–3D correspondences. This resource is designed to support research in 3D reconstruction from single images, 2D–3D detection, cross-modal retrieval, and related computer vision benchmarks involving textual, image, and 3D CAD model queries. 3D-COCO comprises 28,000 curated 3D object models corresponding to the 80 semantic categories in MS-COCO, with each COCO detection annotation linked to a small set of best-fit CAD models based on an intersection-over-union (IoU) matching procedure. The dataset is released under open-source terms, with code and downloads provided for all assets (Bideaux et al., 2024).

1. Dataset Construction and Content

The 3D-COCO dataset was constructed by augmenting the full MS-COCO image collection—164,000 images and 897,000 detection annotations—with explicit 3D object representations:

ShapeNet Integration: For 22 MS-COCO categories with well-defined ShapeNet synsets (e.g., airplane, car, chair), all available ShapeNetCore models (26,254 meshes) were retrieved.
Objaverse Integration: For the 58 remaining categories, visually suitable models were manually identified on Objaverse, resulting in 1,506 additional meshes.
Preprocessing: All CAD models were converted to OBJ format for consistency and centered at the weighted mean of their vertices. Multiple derived representations are supplied per model:
- 32³ voxel grids (binvox)
- 10,000-point point clouds (Open3D)
- 62 rendered images per model (including RGB, gray, depth, silhouette), covering viewpoints sampled from an icosidodecahedron using Blender.
Statistics: On average, each class is represented by ~350 models, with some classes having several thousand and others as few as six. All 80 categories are covered.

A summary table of coverage:

Source	Categories	Models Collected
ShapeNet	22	26,254
Objaverse	58	1,506
Total	80	~28,000

These data support a diverse set of object scales and geometries, making 3D-COCO the most comprehensive 2D–3D-aligned dataset based on MS-COCO imagery to date (Bideaux et al., 2024).

2. 2D–3D Alignment Protocol

To associate each detection in MS-COCO with representative CAD models:

Silhouette Matching: For each annotation, the segmentation mask is rasterized into a 224×224 binary image, normalized so that its silhouette touches the image frame borders.
Viewpoint Sampling and IoU Computation: Each candidate 3D model in the same category is pre-rendered from 62 canonical viewpoints. For every mask, the IoU is computed between the instance mask and every silhouette. The top three models by maximal IoU are retained as matches:

$\mathrm{IoU}(A,B)=\frac{|A\cap B|}{|A\cup B|}$

Normalization: Geometric scaling is standardized via mask-to-image normalization, while candidate model renderings ensure consistent comparison.
Annotation Flags: Each annotation records additional properties, e.g., is_small (area <1% of image), is_truncated (box-to-border <2%), is_crowd (from COCO), is_occluded (mask overlaps another), and is_divided (mask has >1 component).
Camera Parameters: All renders share a fixed camera intrinsics/extrinsics configuration in Blender, and viewpoint indices are logged with each match for reproducibility.

This approach optimizes for silhouette similarity, but does not attempt to estimate the precise camera pose of the original image instance. A plausible implication is that pose-accurate matching would require additional inference or manual annotation for articulated and highly deformable classes.

3. Annotation Schema and Data Access

The 3D-COCO annotation file adopts the native COCO JSON schema but adds the matched-models field for each detection instance. Each entry includes:

model_id (ShapeNet synset+instance or Objaverse UID)
source (ShapeNet/Objaverse)
best_view (index 1–62 of the silhouette yielding maximum IoU)
iou (achieved maximum IoU with instance mask)

Exemplar JSON entry:

{
  "id": ...,
  "image_id": ...,
  "category_id": ...,
  "bbox": [x, y, w, h],
  "segmentation": [...],
  "is_crowd": 0,
  "is_small": 0,
  "is_truncated": 0,
  "is_occluded": 0,
  "is_divided": 0,
  "matched_models": [
    {"model_id":"02958343_0001", "source":"ShapeNet", "best_view":12, "iou":0.82},
    {"model_id":"02958343_0047", "source":"ShapeNet", "best_view":37, "iou":0.80},
    {"model_id":"02958343_0102", "source":"ShapeNet", "best_view":5,  "iou":0.78}
  ]
}

Full dataset access (images, models, annotation JSON) is available via the project website (Bideaux et al., 2024).

4. Supported Research Tasks and Benchmarks

3D-COCO is designed to support a spectrum of vision tasks that require explicit 2D–3D grounding:

Configurable Detection: Joint text, image, or 3D CAD queries. Example modalities:
- Text: Detect standard COCO categories (classic detection).
- Image Patch: Query with a segmentation mask or silhouette to retrieve similar object instances.
- 3D Model: Query with a 3D model to localize matching objects via silhouette or deep-shape feature comparison.
Single-View 3D Reconstruction: Given a real image crop, regress CAD model, voxel grid, or point cloud, using the attached matched models as supervision.
Multi-View/Synthetic Supervision: Use the 62 canonical renders per CAD model as controlled multi-view input for training or evaluation of aggregation/fusion networks.
Evaluation Protocols: Detection can be measured using COCO's standard AP metrics; 3D reconstruction using voxel IoU, Chamfer-L1, surface F-score (e.g., at 1–2 cm), averaged over classes or instances, following ShapeNet conventions.

No official baselines are reported to date; integration with existing COCO detectors or ShapeNet reconstructors is straightforward, given the provided data format (Bideaux et al., 2024).

5. Potential Applications and Research Directions

Recommended usage scenarios include:

Detection-with-3D: Multimodal detectors combining 2D image features and 3D shape descriptors (e.g., via PointNet or volumetric CNNs) using the 2D–3D linkage as explicit supervisory signal.
Single-View 3D Reconstruction: Shape regression supervised by mask and CAD matches, supporting research into geometry prediction from natural images.
Synthetic Multi-View Training: Use of precomputed multi-view renders to train view-fusion models, in a controlled setting.
Extensions and Open Directions:
- Improved articulated object alignment (e.g., humans, animals) via pose estimation or multi-part matching.
- Deep feature-based retrieval superseding silhouette IoU, for robust cross-modal embedding learning.
- Enhanced class balance by expanding Objaverse coverage for under-represented categories.
- Incorporation of photometric cues (surface normals, textures) in matching/reconstruction pipelines.

A plausible implication is that 3D-COCO's open-source, multi-modal structure will accelerate advances in 2D–3D perception, particularly in open-vocabulary detection, shape retrieval, and learning with weak 3D supervision.

6. Relationship to Other 2D–3D Datasets

3D-COCO is positioned as a bridge between real-image 2D detection benchmarks (MS-COCO) and large-scale synthetic 3D shape corpora (ShapeNet, Objaverse), providing coverage for all 80 COCO classes with explicit linkages. Unlike CO3D (Reizenstein et al., 2021), which supplies multi-view RGB videos with camera pose and dense point cloud annotations for 50 categories, 3D-COCO offers broader category coverage and tight alignment with standard detection tasks, but its object–model matches are based primarily on silhouette similarity rather than direct 3D scan or multi-view observation. This suggests 3D-COCO is particularly well-suited for methods evaluating cross-modal retrieval, 3D-aware object detection, and scalable 3D supervision pipelines that are grounded in real-world imagery (Bideaux et al., 2024, Reizenstein et al., 2021).

Markdown Report Issue Upgrade to Chat

References (2)

3D-COCO: extension of MS-COCO dataset for image detection and 3D reconstruction modules (2024)

Common Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to 3D-COCO.