Papers
Topics
Authors
Recent
Search
2000 character limit reached

Sorghum Leaf Imagery Dataset Overview

Updated 7 February 2026
  • Sorghum leaf imagery datasets are curated collections of multi-scale, high-resolution images with detailed annotations for tasks like classification, segmentation, and detection.
  • They enable robust computer vision applications in crop phenotyping, pest monitoring, and cellular-level analysis, leveraging techniques from semantic to instance segmentation.
  • These datasets drive advancements in precision agriculture and plant breeding while also highlighting challenges such as class imbalance, limited geographic coverage, and annotation scope.

Sorghum leaf imagery datasets constitute a foundational resource for computer vision–driven research in crop phenotyping, pest monitoring, and cellular-level plant physiology. These datasets support key machine learning tasks, including classification, semantic and instance segmentation, and small-object detection in real-world agricultural contexts. The datasets described here provide image data and rich labels across several spatial scales, crop attributes, and biological structures. Below, major public sorghum leaf imagery datasets are summarized, with technical details, benchmarking protocols, and coverage of current limitations and challenges.

1. Landscape of Sorghum Leaf Imagery Datasets

Three principal datasets are widely referenced for sorghum leaf imagery tasks:

  • Sorghum Aphid Cluster Dataset (Rahman et al., 2024): Multi-scale in-field imagery with densely annotated aphid clusters, designed for aphid infestation detection and segmentation.
  • Sorghum-100 Dataset (Ren et al., 2021): High-resolution plot-level imagery spanning 100 cultivars, optimized for classification under natural field variability.
  • StomataSeg Dataset (Huang et al., 31 Jan 2026): Microscopy images of sorghum leaf surfaces with instance-level segmentation of stomatal components, supporting high-throughput physiological trait analysis.

These resources address the need for controlled, labeled data representing real agricultural variability—including environmental conditions, plant heterogeneity, and pest spatial distributions.

2. Dataset Composition and Acquisition Protocols

  • Original high-resolution images: 5,447 (3,647 × 2,736 px, GoPro RGB, JPEG/PNG).
  • Patch extraction: 54,742 total patches (10% overlap), stratified by:
    • Viewpoint (top/middle/bottom camera, covering variable canopy heights).
    • Spatial scale—patches at 13.2%, 26.3%, and 52.5% of original image dimensions.
  • Lighting and motion: Strong environmental variation—bright midday sun, intra-canopy shadows, specular highlights, occasional motion blur.
  • Field sites: Kansas State University (northern/southern Kansas), sampled during aphid peak infestation.
  • Image count: 48,106 RGB plot images.
  • Cultivars: 100 genetically distinct lines from the Sorghum Bioenergy Association Panel.
  • Capture system: ARPA-E TERRA-REF gantry, stereo-RGB camera under natural daylight only; native pixel pitch on the order of 0.5–1 mm at canopy height.
  • Temporal and plot sampling: Daily imaging over June, two spatially separated plots per cultivar (exclusive train/test).
  • No visual segmentation: Images cropped to plots; no per-leaf or cluster labels.
  • High-resolution micrographs: 318 full-frame images (2,592 × 1,944 px, Dino-Lite Edge digital microscope).
  • Patch-based extraction: 341 × 341 px patches (10 px overlap; coverage-driven inclusion for manual annotations).
  • Human-annotated subset: 11,060 patches (7,662 train, 2,238 val, 1,160 test).
  • Pseudo-labeled expansion: 56,428 machine-labeled patches (Mask R-CNN–driven semi-supervised process).
  • Genotypic and leaf-surface stratification: Five genotypes, adaxial/abaxial leaf surfaces, three longitudinal regions per surface.

3. Annotation Strategies and Data Organization

Aphid Cluster Dataset

  • Annotation scope: Pixel-level binary masks (background vs. aphid cluster), with clusters defined as visually-aggregated groups of ≥6 aphids (entomological threshold for economic damage).
  • Bounding boxes: Derived from connected component clusters; overlapping boxes merged for object detection benchmarks.
  • Distributional insight: Mean mask coverage per image is 2.45%; over 80% of image patches cover <10% of the frame, with most patches <1% coverage.

Sorghum-100

  • Label attributes: Cultivar class (1–100), days after planting, split assignment.
  • No segmentation: Labels apply at plot level only; no leaf or object labels.
  • Metadata structure: JPEG image files, CSV label sheets (filename, class, DAP, split), possible supplementary JSON (camera pose, timestamp).

StomataSeg

  • Manual instance masks: Three stomatal classes (complex, guard cell, pore), polygonal COCO-style annotations by trained biologists, consensus and expert-reviewed (90% first-pass acceptance).
  • Quality statistics: 40,750 total stomatal instances; mean annotation time per mask ~15 s.
  • Pseudo-label filtering: Class-specific confidence thresholds for inclusion: pore ≥ 0.50, guard cell/complex ≥ 0.70.

Table 1. Patch and Split Counts in Major Datasets

Dataset Train Validation Test Pseudo-labelled (extra)
StomataSeg (Huang et al., 31 Jan 2026) 7,662 2,238 1,160 56,428
Aphid Cluster (Rahman et al., 2024) ~49,020 ~5,470

4. Evaluation Protocols and Benchmarking Metrics

  • Detection (Aphid Cluster):

P=TPTP+FP,R=TPTP+FN,mAP=1CcCAPcP = \frac{TP}{TP + FP},\quad R = \frac{TP}{TP + FN},\quad \mathrm{mAP} = \frac{1}{|C|}\sum_{c\in C} AP_c

  • Segmentation (all):

IoU=PredictionGTPredictionGT,Dice=2PredictionGTPrediction+GT\mathrm{IoU} = \frac{|\mathrm{Prediction} \cap \mathrm{GT}|}{|\mathrm{Prediction} \cup \mathrm{GT}|},\quad \mathrm{Dice} = \frac{2\,|\mathrm{Prediction} \cap \mathrm{GT}|}{|\mathrm{Prediction}| + |\mathrm{GT}|}

  • Instance Segmentation (StomataSeg):

mAP=1Ni=1NAPi,APi=01pi(r)dr\mathrm{mAP} = \frac{1}{N}\sum_{i=1}^N \mathrm{AP}_i,\quad \mathrm{AP}_i = \int_0^1 p_i(r)\,dr

  • Splitting protocols:
    • Aphid Cluster: 10-fold cross-validation on full-resolution images, leaks avoided via patch extraction post-fold allocation.
    • Sorghum-100: spatial plot-based split (one plot/train, one/test); recommend leave-one-cultivar/plot-out for cross-validation.
    • StomataSeg: train/validation/test splits at the image level; pseudo-labels incorporated only for training.

5. Performance Baselines and Practical Recommendations

Aphid Cluster Dataset

Recommendation: Semantic segmentation is preferred for severity/area quantification; object detection is adequate for “alert” style applications.

  • Semantic (SegFormer MIT-B1): mIoU increases from 65.93% (full-frame) to 70.35% (patch-based).
  • Instance (Mask R-CNN + ConvNeXt-V2): AP rises from 28.30% (full-frame) to 46.10% (patches); further to 49.20% after semi-supervised expansion, with AP_pore = 39.10%.
  • Architectural guidance: Multi-resolution/Dynamic Outlier Pooling captures both canopy and fine leaf structure; strict color normalization and plot-based splits mitigate confounding.
  • Task suitability: Cultivar classification demonstrates low inter-class variance—benefits from large scale, rigid train/test spatial isolation, and high intra-plot homogeneity.

6. Applicability, Limitations, and Known Biases

Use Cases

  • Precision agricultural robotics: Triggers for localized smart-spraying based on cluster masks or infestation quantification (Rahman et al., 2024).
  • Phenotyping and breeding: Cultivar identity confirmation, leaf trait variance, and stress response quantification (Ren et al., 2021).
  • Physiological research: Automated, scalable stomatal trait measurement for water-use efficiency and drought tolerance studies (Huang et al., 31 Jan 2026).

Limitations

  • Class imbalance: Positive mask area for pest/symptom is typically <3% (requires weighted or oversampled loss functions).
  • Annotation scope: Aphid and stomatal datasets label only overt clusters or large-enough individual structures; small infestations, single aphids, or sub-threshold components excluded.
  • Geographic and environmental coverage: Main datasets cover limited geographies (Kansas for aphids, Arizona for Sorghum-100, greenhouse-grown for stomata); do not capture all global canopy or pest variability, extreme weather, or night conditions.
  • Exclusion of certain annotations: No comprehensive fine-grained life-stage, physiological, or developmental labels in aphid/stomatal datasets; Sorghum-100 does not annotate individual leaves.

A plausible implication is that while these datasets provide robust platforms for task benchmarking, adaptation to different field environments or broader pathosystem monitoring may require supplemental data or transfer learning.

7. Data Accessibility and Organization

All three datasets provide standardized access conventions:

  • Directory schemas: Human/machine-annotated PNG or COCO-JSON masks (StomataSeg), JPEGs with linked CSVs for plot/metadata (Sorghum-100), organized by train/test splits.
  • License: Sorghum-100 and StomataSeg to be released under CC0 / CC BY-NC-ND 4.0; Aphid Cluster availability as described in (Rahman et al., 2024).
  • Supplementary code/scripts: Cropping, normalization, and loader scripts accompany dataset distributions, with detailed split and preprocessing protocols.

Standardized evaluation metrics, file naming, and preprocessing (especially ImageNet normalization and patch-based extraction) ensure reproducibility and methodological rigor across tasks. For each dataset, recommended best practices include strict adherence to train/test protocols, multi-scale representations, and robust augmentation to mitigate dataset-specific biases.


References:

  • "A New Dataset and Comparative Study for Aphid Cluster Detection and Segmentation in Sorghum Fields" (Rahman et al., 2024)
  • "Multi-resolution Outlier Pooling for Sorghum Classification" (Ren et al., 2021)
  • "StomataSeg: Semi-Supervised Instance Segmentation for Sorghum Stomatal Components" (Huang et al., 31 Jan 2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Sorghum Leaf Imagery Dataset.