Papers
Topics
Authors
Recent
Search
2000 character limit reached

iNat2021: Fine-Grained Biodiversity Dataset

Updated 21 March 2026
  • The iNat2021 dataset is a large-scale, fine-grained benchmark designed for species identification and biodiversity research, comprising millions of field images across thousands of taxa.
  • It offers detailed species metadata and taxonomic hierarchy annotations with balanced splits to mitigate bias and support robust learning in long-tailed distributions.
  • The dataset facilitates evaluation of supervised, semi-supervised, and vision-language models, providing critical insights into real-world ecological and computer vision challenges.

iNaturalist-2021 (iNat2021) is a large-scale, fine-grained image classification dataset specifically constructed to advance research in species identification, biodiversity monitoring, and representation learning in real-world, long-tailed domains. Derived from crowd-sourced observations via the iNaturalist citizen science platform, iNat2021 includes millions of images spanning thousands of taxa, accompanied by rich metadata. Its scale, taxonomic breadth, and inherent class imbalance make it a critical benchmark for evaluating supervised, semi-supervised, self-supervised, and zero-shot learning algorithms under realistic open-world conditions (Horn et al., 2021, Nakkab et al., 2023, Su et al., 2021).

1. Composition and Taxonomy

The iNat2021 dataset encompasses two principal variants referenced in leading studies:

  • Full iNat2021: Consists of approximately 2.7 million images across 10,000 species-level classes. Taxonomic annotation is provided per image, supporting the full Linnaean hierarchy: Kingdom → Phylum → Class → Order → Family → Genus → Species. Metadata fields include common and scientific names, geolocation, timestamp, and observer identity (Nakkab et al., 2023, Horn et al., 2021).
  • Semi-iNat-2021 (Editor's term): A subset and reorganization for semi-supervised/open-set protocols, comprising 2439 species (810 "in-class", 1629 "out-of-class") with 330,000 images, structured to simulate high-ambiguity and domain-shifted conditions (Su et al., 2021).

Each image in the full dataset is JPEG format, typically ranging from 800×600 to 3000×2000 pixels, reflecting in-the-wild field photography. In some semi-supervised settings, images are resized to a maximal 300 pixels per side (Su et al., 2021).

2. Data Splits, Sampling Strategy, and Annotation Protocol

Data splits and sampling are explicitly designed to mitigate observer bias and temporal leakage:

  • Train/Validation/Test splits: For the 10k-class full dataset, training consists of 2.7 million images, validation of 100,000, and test of 500,000, with a subset "mini-train" balanced at 50 images per species (totaling 500,000) (Horn et al., 2021).
  • Temporal partitioning: Training/validation images are drawn from pre-September 25, 2019 observations; test images from the subsequent year (Horn et al., 2021).
  • Observer balancing: Images are selected so that no single observer dominates any species' split; per-species quotas (e.g., test=50, val=10, train≤300 per species) further reduce sample bias (Horn et al., 2021).

Annotations originate from iNaturalist's "research-grade" identifications, requiring concordance by at least two independent identifiers and sufficient photographic evidence. Estimated noise rates fall in the ~5–15% range. A single species label is assigned per image, regardless of content such as indirect evidence or developmental stage (Horn et al., 2021). In Semi-iNat, some images are labeled at the species level, while others receive only coarse labels (kingdom, phylum) or no label, supporting open-set and domain-shift scenarios (Su et al., 2021).

3. Taxonomic Distribution and Class Balance

The class distribution is heavily imbalanced ("long-tailed"), mirroring the frequency of real-world biodiversity observations:

  • Full iNat2021: The number of images per class fif_i ranges from <100 (rare species) to >10,000 ("charismatic" species), with an imbalance ratio IR=maxifi/minifiIR = \max_i f_i / \min_i f_i exceeding 10210^2 to 10310^3 (Nakkab et al., 2023). For the main supervised splits, per-class counts are capped (300 max, 152 min, avg. 267) to prevent severe domination by the most common species, but substantial skew remains across the long tail (Horn et al., 2021).
  • Semi-iNat-2021: The distribution over "in-class" and "out-of-class" species follows N(r)CrαN(r) \simeq C \cdot r^{-\alpha} for ranked class population N(r)N(r), and p(n)=Cnβ, β>1p(n) = C n^{-\beta},\ \beta>1 for the class size density. Labeled images per class in the "in-class" set run from 5–80; the unlabeled pool includes species with up to 400 images, particularly among "out-of-class" taxa (Su et al., 2021).

A breakdown by iconic taxonomic groups reveals marked differences in both sample size and model accuracy per group:

Iconic Group #Species #Train Images Top-1 Acc. (full train)
Insects 2,526 663,682 81.3%
Plants 4,271 1,148,702 80.0%
Fungi 341 90,048 78.6%
Mollusks 169 44,670 75.6%
Birds 1,486 414,847 66.2%
Mammals 246 68,917 59.0%
Reptiles 313 86,830 55.4%
Amphibians 170 46,252 52.6%

(Horn et al., 2021)

4. Supervised, Semi-Supervised, and Vision-Language Protocols

The iNat2021 dataset serves as a testbed for diverse learning paradigms:

  • Supervised learning: Standard approaches use ResNet50 architectures with ImageNet-style augmentations (224×224 crops, flips, color jitter) and per-channel normalization. In supervised benchmarks, ResNet50 (ImageNet-initialized, full train) achieves up to 76.0% top-1 and 91.4% top-5 test accuracy. Using only the mini-train subset, accuracy drops to 65.4% top-1 (Horn et al., 2021).
  • Semi-supervised/open-set/fine-grained learning (Semi-iNat): The absence of out-of-class annotations, presence of coarse taxonomic labels, and explicit domain shift between the labeled and unlabeled pools challenge standard algorithms. Baseline ResNet50 accuracy (trained only on labeled in-class): 41.0% top-1 (ImageNet-pretraining), 19.2% top-1 (random init). The oracle model (with access to all hidden in-class labels) achieves 93.3–94.3% top-1, suggesting the full potential is limited by annotation coverage and effective open-set recognition (Su et al., 2021).
  • Vision-LLMs and zero-shot: Using the "locked-image text (LiT) tuning" protocol, every image is associated with a caption derived from its metadata (e.g., "A photo of the [common name] [taxonomic ranks]"). Evaluation involves computing cosine similarity between frozen image embeddings and caption-derived text embeddings. The LiT-tuned ViT-Large model achieves 63.28% top-1, 87.48% top-5 zero-shot accuracy (Nakkab et al., 2023).

5. Evaluation Metrics and Analysis

Performance on iNat2021 and its semi-supervised variant is assessed using fine-grained classification metrics:

  • Top-1 accuracy: Top-1Accuracy=1Ni=1N1(y^i=yi)\mathrm{Top\text{-}1Accuracy} = \frac{1}{N} \sum_{i=1}^N \mathbf{1}(\hat{y}_i = y_i)
  • Top-kk accuracy: Top-kAccuracy=1Ni=1N1(yiTopK(p^i))\mathrm{Top\text{-}kAccuracy} = \frac{1}{N} \sum_{i=1}^N \mathbf{1}(y_i \in \mathrm{TopK}(\hat{\mathbf{p}}_i))
  • Balanced accuracy: Accbal=1Cc=1C1Nci:yi=c1{y^i=yi}\mathrm{Acc}_{\mathrm{bal}} = \frac{1}{C}\sum_{c=1}^{C}\frac{1}{N_c}\sum_{i:y_i=c}\mathbf{1}\{\hat y_i = y_i\} (Su et al., 2021, Horn et al., 2021, Nakkab et al., 2023)

In some cases, confusion analysis reveals most errors are intra-kingdom, with per-phylum performance correlating with sample abundance (Su et al., 2021). The inclusion of geolocation and temporal metadata enables multi-modal evaluation and fusion strategies, although their effect is context-dependent (Horn et al., 2021).

6. Preprocessing, Cleaning, and Augmentation

Preprocessing protocols are standardized:

  • Image cleaning: Corrupt or unreadable images are filtered out, but no additional manual relabeling is performed. Metadata is retained for all valid entries (Nakkab et al., 2023, Horn et al., 2021).
  • Augmentation (supervised/semi-supervised): Random resized cropping, horizontal/vertical flip, color jittering. Self-supervised pipelines (SimCLR, MoCo v2, SwAV) include multi-crop, color distortion, blur. For open-set/semi-supervised benchmarks, standard protocols are adhered to rather than introducing dataset-specific transforms (Horn et al., 2021, Nakkab et al., 2023).
  • Vision-language tuning: Captions are synthesized automatically from metadata; no manual curation is introduced (Nakkab et al., 2023).

7. Applications, Benchmark Impact, and Future Directions

iNaturalist-2021 is a foundational benchmark for biologically relevant machine learning tasks. Its defining features—fine-grained species labels, extensive taxonomic hierarchy, rich metadata, and a realistic, long-tailed distribution—make it indispensable for advancing:

  • Representation learning (supervised/self-supervised): Tracking advances in feature extractor quality, especially for fine-grained categorization (Horn et al., 2021).
  • Semi-supervised and domain adaptation: Evaluating robustness to open-set and shift conditions through curated splits (Semi-iNat-2021) (Su et al., 2021).
  • Multimodal and zero-shot classification: Enabling vision-language pretraining and evaluation for ecologically and agriculturally pertinent tasks (Nakkab et al., 2023).
  • Community and ecological study: Data provenance and diversity are preserved via observer metadata, supporting research in distribution, behavior, and spatiotemporal dynamics.

The dataset’s metadata-rich structure and scale have established it as a key resource for both method development and benchmarking, particularly as representation learning shifts towards greater generality and robustness in real-world, fine-grained, and imbalanced datasets (Horn et al., 2021, Nakkab et al., 2023, Su et al., 2021).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to iNaturalist-2021 (iNat2021) Dataset.