ILSVRC15 Dataset: Benchmarking Object Recognition

Updated 30 March 2026

The ILSVRC15 dataset is a comprehensive resource for large-scale image classification, localization, and detection, spanning 1,000 to 200 synsets.
It utilizes rigorous annotation protocols via Amazon Mechanical Turk to achieve high precision (≈99.7%) and exhaustive bounding box coverage.
Statistical challenges such as class imbalance and small-object detection are addressed through refined IoU metrics and standardized evaluation methods.

The ILSVRC15 dataset, which underpins the 2015 ImageNet Large Scale Visual Recognition Challenge, serves as a canonical benchmark for large-scale object category classification, localization, and detection within the field of computer vision. Its structure, annotation rigor, and breadth have established it as a critical resource for empirical algorithm evaluation and development, particularly in deep learning and object detection research (Russakovsky et al., 2014).

1. Dataset Composition

ILSVRC15 is composed of three core subdatasets targeting distinct yet interlinked visual recognition tasks: image classification, image localization, and object detection. The classification and localization tasks encompass 1,281,167 training images drawn from 1,000 WordNet synsets, augmented by 50,000 validation images (50 per class) and 100,000 test images (100 per class, annotations withheld). For the detection task, the dataset comprises 200 "basic-level" synsets and includes approximately 456,567 training images (63% from localization set, 24% verified negatives, 13% Flickr scenes), 21,121 validation images, and 40,152 test images. All detection images are exhaustively annotated for object instances, resulting in 478,807 bounding box (bbox) annotations in train (mean ≈2,394 per class; range 502–74,517) and 55,501 bboxes in validation (mean ≈278 per class; range 31–12,824).

2. Data Splits and Class Distribution

The dataset maintains consistent train, validation, and test splits across the 2012–2015 iterations, enabling direct year-to-year benchmarking. For classification/localization, per-class training counts span 732–1,300 (mean μ≈1,281, σ≈160), with no explicit class-wise rebalancing policies. In detection, the class imbalance is more pronounced, with per-class train positives in the range 461–67,513 (mean μ≈2,283). The "positives vs. negatives" ratio varies widely by class in detection, and images in detection splits may contain zero or multiple labels up to the 200-class ceiling.

3. Annotation Protocols and Quality Assurance

Classification and localization labels are crowdsourced via a dynamic-consensus voting protocol on Amazon Mechanical Turk (AMT), pairing the WordNet synset gloss and Wikipedia references for worker disambiguation. The majority threshold is automatically adjusted per synset through an adaptive calibration phase, yielding empirically estimated precision ≈99.7% on held-out manual checks. Bounding box annotations, used in localization and detection, follow a three-stage AMT pipeline: (1) bbox drawing (one per HIT), (2) independent flagging as "good/bad," and (3) holistic coverage verification to ensure all instances are labeled; gold-standard examples are embedded for ongoing quality checks. For detection, label presence/absence is determined through a hierarchical, coarse-to-fine question graph (Deng et al. 2014), exploiting semantic correlations and sparsity to drastically reduce annotation queries to O(log K) per image. Empirical bbox metrics on pilot classes yield 97.9% fully covered images and 99.2% tight bboxes (IoU ≥ 0.5).

4. Dataset Construction and Curation Workflow

Synset selection for classification/localization involves 1,000 non-ancestor synsets from ImageNet, with 639 classes stable from 2010–2014 and periodic adjustments to remove ambiguous or overlapping categories (notably, elimination of "beach" scene classes in 2011 and the addition of 90 dog breeds in 2012). Detection comprises 200 basic-level synsets, chosen for minimal annotation ambiguity and strong overlap with PASCAL VOC classes. Image collection uses multilingual, synonym-augmented queries against Flickr and other search engines, with further manually curated scene-level queries to boost diversity. Crowd instructions emphasize selection diversity (occlusion, instance count, clutter agnostic) and precision ("smallest upright box including only visible parts"). Quality is maintained through gold-standard question interleaving and multi-phase review.

5. 2015 Release Modifications

The ILSVRC15 dataset core splits and class lists remain unchanged versus the 2014 edition. Notable retained augmentations include the expansion of detection training sets in 2014 by 60,658 additional scene and negative images. For detection evaluation, the Intersection-over-Union (IoU) criterion was refined for small objects to account for pixel-level human annotation jitter, introducing

$\mathrm{thr}(B) = \min\left(0.5, \frac{w\,h}{(w+10)(h+10)}\right)$

where $w$ and $h$ are the bbox width and height. Annotation protocols were further clarified to minimize ambiguous-class confusion (e.g., cello vs. violin) and k = 0.6% duplicate bbox error rates.

6. Statistical Characteristics and Evaluation Metrics

ILSVRC15 exhibits both moderate and extreme class imbalance, particularly in detection. Localization challenge is assessed via object scale (mean 35.8% image area in ILSVRC vs. 24.1% in PASCAL VOC), Chance Performance of Localization (CPL, mean 20.8% vs. 8.7%), and clutter metrics (3.59 in ILSVRC vs. 5.90 in PASCAL; the hardest ILSVRC classes approaching VOC difficulty). Top-5 classification error is computed as

$\mathrm{err} = \frac{1}{N}\sum_{i=1}^N \min_j [\,c_{ij} \ne C_i\,]$

where $C_i$ is the ground-truth for image $i$ and $c_{ij}$ are predicted classes. Detection assessments employ per-class precision-recall at score threshold $t$ , matching PASCAL VOC conventions, with average precision derived from PR curves: $\mathrm{Recall}(t) = \frac{\sum_{ij} \mathbf{1}[s_{ij}\ge t]\,z_{ij}}{N}$

$\mathrm{Prec}(t) = \frac{\sum_{ij}\mathbf{1}[s_{ij}\ge t]\,z_{ij}}{\sum_{ij}\mathbf{1}[s_{ij}\ge t]}$

with $z_{ij}=1$ if detection $j$ on image $i$ matches a ground truth box under greedy match (Russakovsky et al., 2014).

7. Recommended Dataset Usage Practices

Researchers are advised to use the top-5 classification error, as hierarchical error provides minimal additional utility. For robust small-object detection, the relaxed IoU threshold is recommended. Annotation or search costs in detection can be optimized by leveraging the presence/absence label hierarchy. Multi-scale and aspect-ratio augmentation are standard and effective strategies, as implemented in top-performing entries (e.g., Krizhevsky et al. 2012; Sermanet et al. 2013; He et al. 2014; Szegedy et al. 2014). Caution is urged when integrating non-provided or weakly supervised data to prevent track contamination.

ILSVRC15 remains the largest and most meticulously annotated public dataset for benchmarking large-scale classification, localization, and detection tasks. Its protocols, scale, and resulting benchmark scores have catalyzed substantial algorithmic progress in convolutional neural networks, region-proposal systems, and bounding box regression methodologies (Russakovsky et al., 2014).

Markdown Report Issue Upgrade to Chat

References (1)

ImageNet Large Scale Visual Recognition Challenge (2014)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ILSVRC15 Dataset.