WildChecklists Dataset: Wildlife AI Benchmark

Updated 28 July 2025

The WildChecklists Dataset is a comprehensive multimodal benchmark combining camera trap, curated, synthetic, and remote sensing images for wildlife monitoring.
It enforces challenging real-world data splits to test algorithmic robustness in domain adaptation, handling diverse species and environmental conditions.
The dataset underpins scalable ecological research, supporting automated species classification with practical insights for biodiversity assessment and conservation.

The WildChecklists Dataset, as exemplified by the iWildCam 2019 and 2020 Competition Datasets, is a comprehensive benchmark suite designed for the development and evaluation of automated species classification systems in the context of wildlife monitoring. These datasets provide diverse image sources, substantial geographic variation, and deliberately challenging settings, all aiming to foster robust generalization, drive advances in multimodal learning, and support ecological applications such as biodiversity assessment and conservation.

1. Dataset Constituents and Structure

The WildChecklists Dataset is assembled from multiple imaging sources, each providing distinct challenges and opportunities for algorithmic development:

Camera Trap Imagery (Primary Modality):
- iWildCam 2019: Training data derives from the Caltech Camera Traps (CCT) dataset ($292,732$ images, $143$ locations, 14 classes) in the American Southwest; test data from Idaho ($153,730$ images, $100$ locations, 8 classes) provided by the Idaho Department of Fish and Game. There is only partial overlap in species between the two regions.
- iWildCam 2020: Contains $217,959$ training and $62,894$ test images, sourced from $441$ training and $111$ test locations, respectively, spanning 12 countries and supporting a global diversity of 276 species. No camera location is shared between train and test partitions.
Supplemental Domains:
- Human-curated Images (iNaturalist): High-quality, annotated photographs serve as auxiliary data for underrepresented species and as a potential bridge over domain gaps.
- Synthetic Images (TrapCam-AirSim): Procedurally generated camera trap scenes provide further population and environmental variation.
- Remote Sensing (Landsat 8): In iWildCam 2020, for each camera location, a temporal sequence of $6$ km $\times$ $6$ km satellite patches is provided, including 9 multispectral bands and 2 quality assessment bands.

The datasets enforce data splits that reflect real-world deployment: geographic, camera location, and environmental context are non-overlapping between train and test, simulating conditions encountered in practice.

Modality	Coverage	Key Features
Camera Trap Images	Regional, global	Uncurated, long-tailed, challenging cases
iNaturalist	High-quality, curated	Citizen science, taxonomically mapped
Synthetic	Simulated environments	Controllable, rare species augmentation
Remote Sensing	Environment/context	Multispectral, atmospheric QA bands

2. Classification Challenges and Domain Shift

The datasets are engineered to pose significant open-set and domain adaptation problems:

Inter-Region Domain Shift: The American Southwest (training) and Northwest (test) in iWildCam 2019, and global train/test disjoint camera locations in 2020, introduce different species compositions, environmental cues, and imaging artifacts.
Class Overlap and Exclusivity: Only a subset of animal categories are present in both training and testing, forcing algorithms to handle missing and novel classes.
Photo-realistic Variability: Camera trap images manifest non-standard poses, poor lighting, severe occlusion, and animal absence (“empty” triggers), which replicate the full spectrum of field conditions.

A central question addressed is the generalization capacity of classification models to unseen regions and unencountered species distributions.

3. Methodological Provisions and Baseline Models

Baseline implementations and evaluation methodology are designed to provide rigorous, reproducible benchmarking:

Evaluation Metric:
- The macro-averaged F1 score is used:
$F1_{macro} = \frac{1}{N} \sum_{i=1}^N \frac{2 \cdot (precision_i \cdot recall_i)}{precision_i + recall_i}$ - This metric weights all classes equally and emphasizes performance on rare categories; $N$ is the number of classes.
Baseline Models:
- iWildCam 2019: Inception-ResNet-V2, pretrained on ImageNet, input resolution $299 \times 299$ px, RMSProp optimizer (momentum $0.9$), initial learning rate $0.0045$, with standard data augmentation. Macro F1 of $0.125$ and accuracy of $27.6\%$ on the test set.
- iWildCam 2020: Inception-v3 with class-balanced loss to address class imbalance.
$\mathcal{L}'(\mathbf{p}, y) = \frac{1-\beta}{1 - \beta^{n_y}} \mathcal{L}(\mathbf{p}, y)$

where $\mathbf{p}$ are softmax probabilities, $n_y$ is per-class support, $\mathcal{L}$ is cross-entropy, and $\beta = 0.9$ .

These baselines formally illustrate the performance gap and the difficulty of robust species recognition under domain shift.

4. Leveraging Transfer Learning and Multimodal Data

Transfer learning and data fusion are explicitly encouraged and supported:

iNaturalist Integration: Taxonomic alignment and curated subsets (e.g., “iNat-Idaho”) enable the use of citizen science images for underrepresented or test-only species. This serves as both a domain adaptation technique and an approach to expand effective class coverage.
Synthetic Data Augmentation: Synthetic images generated by simulation environments (TrapCam-AirSim) allow controlled experiments on rare species and confounding backgrounds.
Remote Sensing Context: For the 2020 dataset, multispectral satellite imagery provides environmental priors; integrating this information can help models learn associations between habitat and appearance, and may improve generalization when environmental features are indicative of particular species.

A plausible implication is that multimodal fusion—combine camera trap, synthetic, human-curated, and remote-sensing domains—could enable context-aware classifiers that are less sensitive to data sparsity and domain shift.

5. Regional and Environmental Considerations

Deliberate geographic partitioning structures emphasize substantive differences:

Ecological Variation: The American Southwest training and Northwest test sets (2019) reflect distinct faunal assemblages and ecosystems. Global stratification in 2020 increases representational diversity and ecological realism.
Environmental Context Cues: Satellite data contextualizes each observation within landcover, vegetation, and climatic patterns—potentially allowing models to infer species presence based on habitat suitability.
Empty Image Prevalence: Both datasets contain a high fraction of empty triggers, reflecting the operation of real camera traps, necessitating classifiers that can distinguish between valid species presence and non-target activation.

This approach aligns the dataset with field conditions encountered by biodiversity practitioners and wildlife ecologists.

6. End-User Applications and Impact

The dataset suite is directly targeted at ecological and conservation objectives:

Automated Annotation: By substantially reducing the effort required to label large volumes of camera trap data, these benchmarks enable scalable biodiversity monitoring.
Conservation and Wildlife Management: Reliable cross-region classification models support assessments of population size, distribution, and movement patterns—critical for managing species at risk from anthropogenic pressures.
Scalability for Global Monitoring: Since the partitioning strategy encourages algorithmic robustness to novel locations, these benchmarks are germane to questions of real-time or near-real-time identification at continental scales.

From a computer vision standpoint, the datasets push species classification research towards robust generalization, cross-domain adaptation, and the principled use of context cues.

7. Challenges, Limitations, and Research Directions

Key limitations and research frontiers are shaped by the deliberate complexity of the datasets:

Class Imbalance and Rarity: Extremely skewed class distributions and limited examples for many species stress both learning and evaluation.
Unseen Classes: The test set may contain species not present in training, presenting open-set recognition and zero-shot learning challenges.
Multimodal Fusion Complexity: Integrating disparate data types (visual, synthetic, environmental) raises methodological questions about effective representation, alignment, and joint modeling.
Environmental Noise: Remote sensing data, while valuable, can be affected by factors like cloud cover and sensor artifacts, necessitating robust preprocessing and quality control.

A plausible implication is that progress in these areas may drive methodological advances transferable beyond biodiversity assessment to domains involving distributional shift and multimodal data integration.

In summary, the WildChecklists Dataset represents a rigorously designed, multi-source benchmark for cross-region, multimodal, and robust animal species classification. It serves as a critical test-bed for algorithmic innovation in domain adaptation, transfer learning, and ecological informatics, linking computer vision advances directly to pressing scientific and conservation needs (Beery et al., 2019, Beery et al., 2020).

PDF Markdown Chat (Pro)

References (2)

The iWildCam 2019 Challenge Dataset (2019)

The iWildCam 2020 Competition Dataset (2020)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to WildChecklists Dataset.