Papers
Topics
Authors
Recent
Search
2000 character limit reached

Sorghum-100 Dataset Benchmark

Updated 15 March 2026
  • Sorghum-100 is a large-scale benchmark of 48,106 RGB images capturing 100 visually similar sorghum cultivars from diverse field plots.
  • The dataset supports a dual-branch ResNet-50 architecture that fuses global and local image features using Dynamic Outlier Pooling for enhanced classification.
  • Empirical evaluations show that Dynamic Outlier Pooling outperforms GAP and GMP, achieving state-of-the-art accuracy of 78.79% on fine-grained cultivar identification.

The Sorghum-100 dataset is a large-scale, fine-grained benchmark specifically constructed for cultivar-level classification of sorghum from high-resolution RGB field imagery. It was designed to support the development and evaluation of visual recognition models under real-world, @@@@1@@@@ conditions, focusing on distinguishing between 100 visually similar sorghum cultivars. The dataset is closely linked to new methodological advances, notably a multi-resolution ResNet-50-based architecture and the Dynamic Outlier Pooling strategy, which together set a state-of-the-art baseline in the domain of agricultural image classification (Ren et al., 2021).

1. Dataset Composition and Image Acquisition

Sorghum-100 comprises a total of 48,106 RGB images corresponding to 100 unique sorghum cultivars, resulting in an average of approximately 481 images per cultivar. Each cultivar is represented by crops planted in two geographically distinct field plots, enabling assessments of classifier robustness to soil and micro-environmental heterogeneity.

Images were collected in June 2017 using the TERRA-REF gantry system—a field-automated phenotyping platform equipped with high-resolution stereo RGB cameras. Only RGB channels were utilized for this dataset, though the system also supports thermal, hyperspectral, and 3D modalities. Imaging was performed from a predominantly nadir angle (top-down or slightly oblique) on a daily basis during mid-season growth and before lodging events.

Native image resolutions exceeded 2K × 2K pixels. For model training, images were resized such that the shorter side was either 512 px (for global input) or 1,024 px (for the local branch), followed by 512 × 512 pixel cropping. Data augmentation included random horizontal and vertical flips and per-channel mean and standard deviation normalization. Plot-centric cropping was guided by exact camera pose metadata and ground-truth polygons to exclude adjacent cultivar interference.

Property Value/Procedure Notes
Number of images 48,106 ~481 images per 100 cultivars
Image resolution Native >2K × 2K, resized/cropped to 512 × 512 Separate resize for each network branch
Sensor setup TERRA-REF gantry, stereo RGB cameras Thermal/hyperspectral unused in baseline
Data split By plot: one plot per cultivar for training/test No cross-validation
Annotation Cultivar ID (100 classes), day-after-planting No bounding box or pixel-level annotation

Training and test splits are strictly plot-based: for each cultivar, one plot's images are assigned to training and the other's to testing, eliminating overlap in environmental conditions between these sets. Each image is annotated with its cultivar class and relative planting date. No finer-grained supervision, such as bounding boxes or segmentation, is provided; labels correspond to the entire crop plot captured in each image.

2. Multi-Resolution Network Architecture

The standard baseline for Sorghum-100 is a dual-branch Convolutional Neural Network model, adopting ResNet-50 trunks pre-trained on ImageNet as backbones. The global branch receives a full plot image at reduced resolution (shortest side 512 px), while the local branch ingests four random 512 × 512 crops from a higher-resolution resize (shortest side 1,024 px). All four local crops are processed using shared ResNet-50 weights, enabling feature consistency across local samples.

After individual feature extraction, each branch applies the same global pooling strategy (Dynamic Outlier Pooling; see Section 3) to yield a 2,048-D representation. The outputs from both branches are concatenated into a single 4,096-D vector, which is then passed through a fully connected classification layer. This two-branch network explicitly integrates coarse global cues, such as canopy structure and spacing, with fine local details like panicle and floral morphology.

Layer specifications of each ResNet-50 are unmodified from the canonical architecture:

  • conv1: 7×7 kernel, stride 2, 64 channels → BatchNorm → ReLU → 3×3 max-pool, stride 2
  • conv2_x through conv5_x: standard bottleneck blocks with filter dimensions 256, 512, 1,024, and 2,048 channels respectively, batch normalization, ReLU activations, and residual connections.

During feature aggregation, cross-crop pooling (pooling applied jointly to the four local crops) enables the preservation of multiple spatially distinct cues such as repeated inflorescences, which are critical for cultivar discrimination.

3. Dynamic Outlier Pooling: Definition and Properties

The Dynamic Outlier Pooling (DOP) strategy is central to Sorghum-100's methodological contribution. Unlike Global Average Pooling (GAP) or Global Max Pooling (GMP), DOP adaptively focuses on spatially sparse but highly informative activation patterns within deep feature maps.

Given a per-channel activation map ARH×WA\in \mathbb R^{H\times W}:

  • Mean μj=1HWxAx\mu_j = \frac{1}{HW}\sum_x A_x and standard deviation σj=1HWx(Axμj)2\sigma_j = \sqrt{\frac{1}{HW}\sum_x (A_x - \mu_j)^2} are computed per channel jj.
  • A threshold tj=μj+λσjt_j = \mu_j + \lambda \sigma_j is applied, with λ=2.0\lambda=2.0 (chosen through grid search).
  • Outlier activations are selected via Ix,j=1I_{x, j} = 1 if AxtjA_x \geq t_j, else $0$.

“Static” Outlier Pooling averages only above-threshold activations. Dynamic Outlier Pooling interpolates between standard spatial average pooling and outlier pooling based on learning epoch:

  • w1=1+eEw_1 = 1 + \frac{e}{E}; w2=1eEw_2 = 1 - \frac{e}{E} where ee is the current epoch, EE is total epochs.
  • Final pooled output:

yj=w1xIx,jAx+w2x(1Ix,j)Axw1xIx,j+w2x(1Ix,j)y_j = \frac{w_1 \sum_x I_{x, j} A_x + w_2 \sum_x (1 - I_{x, j}) A_x}{w_1\sum_x I_{x, j} + w_2\sum_x(1 - I_{x, j})}

At test time (e=Ee=E), pooling reduces to pure outlier pooling.

Relative to GAP, which averages over all spatial activations, and GMP, which selects only the maximum value, DOP preserves signal from all spatial locations with high activations (e.g., capturing all panicles in an image rather than a single largest one or diluting their signal via averaging). This design provides stable gradients when features are initially diffuse and increasingly emphasizes salient local features as learning progresses. Computationally, per-channel statistics are computed per forward pass with negligible overhead.

4. Empirical Evaluation and Ablation

The benchmarking of the Sorghum-100 dataset is anchored by direct performance comparisons between single-branch and multi-branch architectures, as well as between pooling strategies. Key experimental results include:

  • On low-resolution whole-image input:
    • GAP: 72.14% accuracy
    • GMP: 64.62%
    • Dynamic Outlier Pooling: 73.83%
  • On high-resolution four-crop input:
    • GAP: 74.30%
    • GMP: 65.18%
    • Dynamic Outlier: 76.92%
  • On the full multi-resolution model:
    • GAP: 76.33%
    • GMP: 65.49%
    • Dynamic Outlier: 78.79%

Ablation studies demonstrate that the multi-resolution approach consistently outperforms single-scale models, and Dynamic Outlier Pooling outperforms both GAP and GMP in all resolution regimes. Training with dynamic (vs. static) outlier pooling enables faster convergence. Pooling visualizations using synthetic “pink flower” filter maps validate that GAP dilutes multiple salient features, GMP captures only the strongest, while outlier pooling accurately selects all super-threshold activations across space.

Error analysis indicates that classification errors are dominantly due to confusion between visually similar cultivars, reflecting the fine-grained nature and low inter-class variance of the dataset. No formal p-values, F1 scores, or confusion matrices are published.

5. Practical Aspects and Implementation

Model training utilizes ResNet-50 backbones and NVIDIA V100 GPUs (16 GB RAM). The batch size is selected based on total GPU memory to accommodate both the global and four local crop branches. Typical training extends over 20 epochs, with Dynamic Outlier Pooling reducing the total epochs required for convergence.

At inference, the single-branch model (global view) processes one image per forward pass at standard ResNet-50 speed (~50 ms/image on NVIDIA V100), while the full multi-resolution model requires five passes (one global, four local), totaling approximately 250 ms/image. This runtime is compatible with near real-time field phenotyping, but the architecture may require adaptation for edge devices or for cases with higher sample throughput.

Image pre-cropping by plot is automated via GPS/RTK-aligned gantry metadata. The dataset's lack of cross-validation implies possible impact from unaccounted variation in field conditions between training and test plots. Pooling thresholds are determined per image, raising prospects for future research into per-batch or learned per-channel thresholds. Potential extensions include deeper feature pyramids and the integration of multi-sensor modalities (e.g., hyperspectral or thermal data).

6. Significance, Limitations, and Prospects

Sorghum-100 is a comprehensive, challenge-ready dataset for fine-grained, plot-level cultivar recognition under unconstrained field conditions, featuring controlled annotation and well-defined evaluation protocols. The accompanying multi-resolution architecture and DOP strategy collectively achieve a test-set accuracy of 78.79% on the 100-cultivar classification task, substantially surpassing prior pooling strategies. Limitations include the absence of cross-validation, reliance on RGB alone, and focus on global/local scale fusion rather than a full resolution pyramid.

This suggests Sorghum-100 is positioned as a reference benchmark for the assessment of novel aggregation, representation learning, and scale-attentive architectures in agricultural machine vision. It also provides a foundation for method development that can be generalized to other densely-planted, low inter-class variance, high-throughput phenotyping problems (Ren et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Sorghum-100 Dataset.