PlantVillage Apple Leaf Disease Dataset

Updated 6 February 2026

The PlantVillage Apple Leaf Disease Dataset is a curated collection of expert-annotated apple leaf images showcasing both healthy and diseased conditions for benchmarking machine learning models.
Its standardized image acquisition, preprocessing, and expert curation enable reliable evaluations, with benchmark studies reporting accuracies above 99% on controlled test sets.
Despite its value, the dataset exhibits limitations such as capture bias and class imbalance, prompting ongoing efforts to extend real-world applicability and robust evaluations.

The PlantVillage Apple Leaf Disease Dataset is a curated collection of labeled digital images of apple leaves exhibiting various disease symptoms and healthy conditions, designed as a benchmark for machine learning-based plant pathology. Originating from the PlantVillage initiative, with subsequent extensions and rigorous academic analyses, this dataset has become the reference standard for evaluating disease recognition algorithms in data-driven precision agriculture.

1. Dataset Composition and Class Taxonomy

The canonical PlantVillage Apple Leaf Disease dataset consists of expert-verified RGB images of isolated apple leaves, annotated for visually detectable disease symptoms. The four principal categories are:

Apple scab (Venturia inaequalis)
Apple black rot (Botryosphaeria obtusa)
Cedar apple rust (Gymnosporangium juniperi-virginianae)
Healthy (no visible symptoms)

Reported class distributions vary slightly across publications due to subset selection:

Source	Apple Scab	Black Rot	Cedar Rust	Healthy	Total
(Hughes et al., 2015) (original PV release)	630	621	276	1,645	3,172
(Mahamood et al., 29 Jan 2026) (Mam-App)	630	621	275	1,645	3,171
(Roumeliotis et al., 29 Apr 2025) (GPT-4o/ResNet-50 subset)	631	622	276	1,646	3,175
(Ashmafee et al., 2023) (EfficientNetV2S)	~630	~620	275	1,645	3,171

The "PV-ALE" extension adds Alternaria leaf spot (85 images) and powdery mildew (127 images) with more complex backgrounds, raising the total to 3,383 images and broadening the taxonomy to six classes (Akinyemi et al., 2024).

2. Image Acquisition Protocols and Preprocessing

Most images were acquired using controlled laboratory setups—detached leaves photographed under varied natural light against uniform gray or black paper, at high native resolution (e.g., Sony DSC-RX100, ~20 MP). For the original PlantVillage images, 4–7 orientations per leaf were captured to document morphological variability (Hughes et al., 2015). Later studies typically downsampled all images to 225–256 px squares to accommodate deep learning backbones (Mohanty et al., 2016, Mahamood et al., 29 Jan 2026, Ashmafee et al., 2023, Akinyemi et al., 2024).

Key preprocessing procedures include:

Manual cropping to minimize background content and standardize orientation (tip up)
Pixel-value normalization ([0,1] or channel-wise zero-mean/unit-variance)
Consistent train/validation/test splits, commonly stratified at 60/20/20 or 70/15/15 ratios, with patient-level grouping to avoid data leakage (Mahamood et al., 29 Jan 2026, Ashmafee et al., 2023)

Data augmentation is selectively applied during training. Techniques span random flips, rotations (±10° to ±30°), brightness adjustment, and geometric shear/shifts—parameters are explicitly tabulated in works such as (Mahamood et al., 29 Jan 2026) and (Ashmafee et al., 2023).

3. Annotation and Curation Procedures

All PlantVillage apple images are annotated by plant pathology experts, either via field diagnosis during deliberate inoculation trials or from sentinel plots. Only specimens with diagnostic consensus are admitted (Hughes et al., 2015). For PV-ALE, newly sourced disease classes undergo strict human verification, URL/source filtering, and, where possible, further cropping or background minimization (Akinyemi et al., 2024).

Leaf-level labeling is enforced to prevent multiple views of the same specimen from propagating between training and evaluation splits, mitigating information leakage (Mohanty et al., 2016).

4. Benchmarking, Evaluation Protocols, and Model Performance

Standard performance metrics include accuracy, precision, recall, and $F_1$ -score, all per the following formulae (using the notation in (Ashmafee et al., 2023, Akinyemi et al., 2024)):

$\mathrm{Accuracy} = \frac{TP + TN}{TP + FP + TN + FN}$

$\mathrm{Precision} = \frac{TP}{TP + FP}, \quad \mathrm{Recall} = \frac{TP}{TP + FN}, \quad F_1 = 2 \frac{\mathrm{Precision} \cdot \mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}}$

Recent studies have achieved the following peak test-set performance on the canonical 4-class dataset:

Mam-App (Mamba, 0.051M params): 99.58% accuracy, 99.30% precision, 99.14% recall, 99.22% $F_1$ -score (Mahamood et al., 29 Jan 2026)
EfficientNetV2S (transfer learning, augmented): 99.21% accuracy, 99.15% precision, 99.10% recall, 99.12% $F_1$ -score (Ashmafee et al., 2023)
ResNet-50 (fine-tuned): 96.88% at 256px input, 0.969 $F_1$ (Roumeliotis et al., 29 Apr 2025)
ResNet-34 (Fastai): 93.8% accuracy (Zhang et al., 2021)

Extensions—including PV-ALE—demonstrate a reduction in $F_1$ from 99.63% (original) to 97.87% (extended, 6 classes, ResNet-50) due to class imbalance, low-resolution additions, and complex backgrounds (Akinyemi et al., 2024).

5. Dataset Biases, Artifacts, and Limitations

Capture bias—inadvertent correlation between label and image background or acquisition parameters—is explicitly exposed in (Noyan, 2022), where a random forest trained on the RGB values of just 8 background pixels achieves 49.0% accuracy (random chance for 38 classes is 2.6%). This demonstrates that consistent background, lighting, and equipment contribute dataset-specific signal exploitable by classifiers, even in the absence of leaf content.

Attempts to mitigate such bias via background removal are only partially effective; capture bias persists through foreground-only image features. The lack of in situ images (field backgrounds, attached leaves) further limits real-world model transferability (Hughes et al., 2015, Mohanty et al., 2016, Akinyemi et al., 2024).

Class imbalance is intrinsic, with healthy samples often dominating (ratio up to 6:1 over minority diseases), especially before augmentation (Mahamood et al., 29 Jan 2026, Akinyemi et al., 2024). While augmentation and label smoothing partially address this, the scarcity of genuine rare disease samples (e.g., Alternaria, powdery mildew) and the limited per-class test-set cardinality restrict the statistical robustness of evaluation.

6. Extensions, Comparative Studies, and Use in Advanced Pipelines

The PlantVillage Apple dataset remains the prototypical reference for classical and modern computer vision approaches in plant pathology. Transfer learning (GoogLeNet, ResNet, EfficientNetV2S, Mamba) consistently outperforms models trained from scratch (Mohanty et al., 2016, Ashmafee et al., 2023, Mahamood et al., 29 Jan 2026). Recent work also demonstrates the viability of multimodal LLMs, GPT-4o, for disease classification with fine-tuning achieving 98.12% accuracy on balanced 4-class subsets (Roumeliotis et al., 29 Apr 2025). However, zero-shot LLM performance lags, highlighting the importance of minimal supervised adaptation.

The PV-ALE variant provides a more challenging evaluation due to background/lighting complexity and the inclusion of additional diseases. Table: PV and PV-ALE benchmark (ResNet-50):

Dataset	Classes	Accuracy (%)	F₁-score (%)
PV-AL	4	99.58	99.63
PV-ALE	6	99.11	97.87

Open access is guaranteed via www.plantvillage.org and linked repositories (Kaggle). The data are governed by a CC BY-SA 3.0 license, mandating that derivative models be openly released (Hughes et al., 2015, Akinyemi et al., 2024).

7. Emerging Best Practices and Future Directions

The PlantVillage Apple dataset should not be viewed as a proxy for real-world, in-field plant disease detection. Robust benchmarking necessitates:

Quantitative bias testing (e.g., pure-background inputs, random devices) prior to model evaluation (Noyan, 2022)
Consistently reporting per-class metrics, not only accuracy, to capture rare class performance (Ashmafee et al., 2023, Mohanty et al., 2016)
Augmenting datasets with realistic, cluttered backgrounds; extending taxonomies to capture broader disease spectra (Akinyemi et al., 2024)
Partitioning splits at the specimen (leaf) level, not by individual image, to prevent overfitting (Mohanty et al., 2016)
Exploring advanced augmentation, synthetic data generation, and domain adaptation for field transferability (Akinyemi et al., 2024)

The definitive limitations of the PlantVillage Apple Leaf Disease dataset are its controlled acquisition, narrow disease spectrum, class imbalance, limited background/lighting variation, and the strong capture bias inherent in uniform-background imaging. Nevertheless, it remains a cornerstone for empirical validation in plant disease computer vision, supporting the development and comparison of increasingly advanced, efficient, and generalizable models.