Crop Yield Prediction Integrating Genotype and Weather Variables Using Deep Learning

Published 24 Jun 2020 in cs.LG and stat.ML | (2006.13847v1)

Abstract: Accurate prediction of crop yield supported by scientific and domain-relevant insights, can help improve agricultural breeding, provide monitoring across diverse climatic conditions and thereby protect against climatic challenges to crop production including erratic rainfall and temperature variations. We used historical performance records from Uniform Soybean Tests (UST) in North America spanning 13 years of data to build a Long Short Term Memory - Recurrent Neural Network based model to dissect and predict genotype response in multiple-environments by leveraging pedigree relatedness measures along with weekly weather parameters. Additionally, for providing explainability of the important time-windows in the growing season, we developed a model based on temporal attention mechanism. The combination of these two models outperformed random forest (RF), LASSO regression and the data-driven USDA model for yield prediction. We deployed this deep learning framework as a 'hypotheses generation tool' to unravel GxExM relationships. Attention-based time series models provide a significant advancement in interpretability of yield prediction models. The insights provided by explainable models are applicable in understanding how plant breeding programs can adapt their approaches for global climate change, for example identification of superior varieties for commercial release, intelligent sampling of testing environments in variety development, and integrating weather parameters for a targeted breeding approach. Using DL models as hypothesis generation tools will enable development of varieties with plasticity response in variable climatic conditions. We envision broad applicability of this approach (via conducting sensitivity analysis and "what-if" scenarios) for soybean and other crop species under different climatic conditions.

Abstract PDF Upgrade to Chat

Citations (119)

View on Semantic Scholar

Summary

The paper introduces hybrid LSTM and attention models to combine genotype, weather, and maturity data for crop yield prediction.
It demonstrates superior performance with a test RMSE of 7.130 and an R² of 0.802, outperforming traditional models.
The use of temporal attention clarifies the impact of late-season weather variables on yield, supporting targeted breeding strategies.

Deep Learning Integration of Genotype and Weather Variables for Crop Yield Prediction

Introduction

This paper addresses the complex challenge of predicting soybean yield across diverse multi-environment trials by integrating genotype information and multi-variate weather time series, using deep learning (DL) frameworks. Yield prediction is central to modern plant breeding and agricultural management, especially under increasing climate variability. While genotype-by-environment interactions (GxE) have conventionally been handled via process-based or low-capacity statistical models, the authors propose a hybrid framework that combines LSTM-based sequence models and an attention mechanism for both prediction accuracy and interpretability.

Dataset and Feature Engineering

The study leverages an extensive dataset comprising 13 years (2003–2015) of National Uniform Soybean Tests (UST) across 150 locations in the US and Canada (Figure 1), generating 103,365 performance records linked to 5839 unique genotypes. These records are matched to weekly weather aggregates over 30 weeks of the growing season. Genotype structure is captured via pedigree-derived K-means clustering (five clusters), and maturity group (MG) information is included to better capture known adaptation zones.

Figure 1: Geographic distribution and density of soybean variety trial locations in the USA and Canada, illustrating data volume and maturity group partitioning.

Model Architecture

The authors develop two model variants:

Stacked LSTM Model: Encodes the temporal dynamics of the seven weather variables, with optional augmentation by genotype cluster and MG at each time step and/or prediction layer (Figure 2).
Temporal Attention Model: Incorporates a temporal attention mechanism over LSTM-encoded hidden states, learning to weight the contribution of each growth stage to yield prediction (Figure 3).

Both models are trained for regression (seed yield, many-to-one setup) and compared to Random Forest (RF), LASSO, and the USDA's state-level linear regression model.

Figure 2: Architecture of the stacked LSTM model for encoding weather and genotype/management features across time.

Figure 3: Temporal attention model schematic, illustrating attention weights over LSTM-encoded time series for interpretability.

The models are trained using Adam optimizer, with hyperparameters tuned for weekly weather sampling based on RMSE minimization.

Experimental Results

Model Inputs and Variable Importance

A series of ablation studies quantifies the impact of including MG, genotype cluster, and various weather variable subsets (Figure 4). Including both genotype and MG with weather data yields the highest performance (test RMSE = 7.130, $R^2=0.802$ ), outperforming all baselines. Greedy search for weather variable importance consistently identifies minimum surface temperature and average irradiance as highly informative, with variable rankings differing slightly by region.

Figure 4: Triangular diagram illustrating relative contributions of maturity group, genotype cluster, and weather variables to the LSTM model’s predictive accuracy.

Comparative Evaluation

The LSTM-based models substantially outperform LASSO (RMSE = 12.779) and Random Forest (RMSE = 9.889) baselines. When compared with the USDA’s statewide regressor, the DL approach achieves lower absolute errors for all but one year (see supplementary Table 10 for annual breakdown). These results establish the quantitative superiority of LSTM frameworks for multivariate, multi-environment yield prediction.

Attention-based Interpretation

A key innovation is the use of temporal attention, which provides insight into the relative importance of different segments of the growing season. Results indicate that late-season (August–September, typically reproductive phase) variables dominate yield prediction for high-performing genotypes (Figure 5). This corresponds to domain expectations and supports hypothesis generation for critical phases affecting yield plasticity in different maturity zones.

Figure 5: Distribution of temporal attention weights for two distinct maturity groups, linking critical phenological stages to yield impact.

Discussion and Implications

The presented framework demonstrates capacity for high-resolution, explainable yield prediction leveraging both pedigree-derived genotype structure and multi-modal weather time series. The authors note that the model’s predictive error is roughly 14% of the mean test set yield, signifying robust performance even in highly variable environments.

A salient outcome is the ability to interrogate the model for the effect of weather events at distinct phenological stages, enabling targeted hypothesis generation for future breeding strategies. Notably, the finding that minimum surface temperature (nighttime temperatures) exerts outsized importance is contrary to previous empirical heuristics and signals the need for further physiological investigation. This computational approach, thus, augments traditional process-based modelling with new correlative hypotheses ready for experimental validation.

The methodology generalizes naturally to other crops and environments, given sufficient multivariate time series and pedigree/cluster information. However, the absence of full molecular marker data for genotypes limits resolution for genomic prediction integration—a noted avenue for future work. Similarly, inclusion of additional management, soil, and phenomic data would likely enhance performance and granularity.

The authors also emphasize the enhanced utility of attention mechanisms for moving DL models beyond "black box" status, supporting their use in both operational yield forecasting and as tools for scientific exploration of GxExM relationships under climate change.

Conclusion

This study establishes that LSTM-based and attention-augmented deep learning models can accurately predict crop yield in diverse environments by integrating environmental and genotype information. The approach achieves higher accuracy and finer granularity than established statistical and data-driven USDA models, while simultaneously addressing model interpretability via temporal attention. The DL models are positioned not only as predictive tools but as engines for hypothesis generation, adapting plant breeding and management strategies for climatic variability. Future work should focus on incorporating richer genomic and management data and advancing causal inference in deep learning to more rigorously dissect GxE interactions.

Markdown