- The paper introduces a deep learning framework that integrates genotype clusters, Maturity Groups, and weekly weather data to improve soybean yield predictions.
- The study demonstrates that combining temporal weather patterns with genetic information significantly lowers RMSE compared to traditional models.
- The temporal attention mechanism identifies critical growth stages, offering actionable insights for plant breeding and agricultural management.
This paper (2006.13847) presents a deep learning framework for predicting soybean crop yield by integrating genotype information (specifically, pedigree-based clusters and maturity groups) and time-series weather data across multiple environments and years. The research addresses the critical challenge of predicting crop performance under varying climatic conditions, which is essential for plant breeding, agricultural production, and mitigating the impacts of climate change.
Problem Addressed
Accurate crop yield prediction is complex due to the significant interaction between genotype (G), environment (E), and management (M). Traditional methods like process-based crop models are mechanistic but often struggle with complex or unforeseen conditions and require extensive parameter estimation. Linear models are simpler but fail to capture intricate biological and environmental interactions. Data-driven approaches based on geospatial or limited meteorological data have been developed, but often lack the ability to model temporal dependencies effectively, integrate detailed genotype information, or provide insights into which factors are most influential at different times. This paper aims to overcome these limitations using deep learning, specifically focusing on capturing temporal weather patterns and leveraging genetic relatedness information.
Data and Preprocessing
The paper utilizes historical performance records from the Uniform Soybean Tests (UST) in North America spanning 13 years (2003-2015). The dataset comprises 103,365 performance records from 5,839 unique genotypes across 150 locations (Figure 1). For each record, yield data and associated management information are available.
Weather data for each location and year combination were acquired from the nearest weather station (25km grid) from Weather.com. Daily weather records were downsampled to weekly aggregates over a 30-week growing season (April 1 to October 31). Seven weather variables were included: Average Direct Normal Irradiance (ADNI), Average Precipitation Previous Hour (AP), Average Relative Humidity (ARH), Maximum Direct Normal Irradiance (MDNI), Maximum Surface Temperature (MaxSur), Minimum Surface Temperature (MinSur), and Average Surface Temperature (AvgSur) (Table 1 in Supplementary Materials). The paper explored different temporal resolutions (daily, weekly, bi-weekly, monthly) and found weekly data provided a good balance between detail and training speed (Table 3 in Supplementary Materials).
Genotype information included Maturity Group (MG) and pedigree data. Since molecular marker data was unavailable for most genotypes, a pedigree-based correlation matrix was constructed for all lines with parentage information. K-means clustering was applied to this correlation matrix to group genotypes into 5 clusters based on relatedness. The cluster assignment and Maturity Group were used as genotype-specific features in the models.
All input features were scaled to the range (-1, 1) based on the training set statistics. The dataset was randomly split into 80% for training, 10% for validation, and 10% for testing.
Methodology: Deep Learning Models
The core of the methodology lies in using Recurrent Neural Networks (RNNs), specifically Long Short-Term Memory (LSTM) networks, which are well-suited for modeling sequential data like time-series weather information. The vanishing gradient problem in standard RNNs during training with long sequences is mitigated by LSTM's internal gating mechanisms.
Two primary models were developed:
- Stacked LSTM Model (without attention): This model uses two stacked LSTM layers to process the sequence of weekly weather data. The final hidden state of the second LSTM layer, which is assumed to encode information from the entire input sequence, is then fed into a dense layer to predict the final yield.
- Temporal Attention Model: This model also uses stacked LSTM layers to process the input sequence, generating annotations (hidden states) for each time step. Instead of relying solely on the final hidden state, this model incorporates a temporal attention mechanism. The attention mechanism computes a weighted sum of all time-step annotations, where the weights indicate the importance of each time step for the final prediction. This weighted sum forms a "context vector" which is then used by a dense layer for yield prediction. This mechanism provides interpretability by highlighting critical time windows during the growing season (Figure 3).
For both models, the authors explored two variants based on the input information:
- Using only the 7 weekly weather variables as input sequence.
- Concatenating the Maturity Group and genotype cluster information with the weather variables at each time step and also after the second LSTM layer, before the final prediction layer. The supplementary materials describe how these additional inputs were incorporated into the model architecture (Tables 4 and 5 in Supplementary Materials).
The models were trained using the Adam optimizer with a learning rate of 0.001 and Mean Squared Error (MSE) as the loss function. Training was performed for 200 epochs. Hyperparameter tuning included evaluating different input sequence lengths (Tx) and exploring how to best incorporate MG and cluster information. A greedy forward selection approach was used to assess the importance of individual weather variables, both for the entire dataset and separately for northern (MG 0-4) and southern (MG 4-8) regions (Tables 6, 7, 8 in Supplementary Materials).
Implementation was done using Keras with a TensorFlow backend, utilizing NVIDIA GPUs for training.
Results and Performance
The key findings include:
- Performance Comparison: Both the Stacked LSTM and Temporal Attention models significantly outperformed baseline models like Random Forest (RF) and LASSO regression (Table 9 in Supplementary Materials). For Tx=30 weeks, the best Stacked LSTM model achieved a Test RMSE of 7.130 bu/acre, compared to 9.889 for RF and 12.779 for LASSO.
- Impact of Inputs: The best performance was achieved when information from all sources – weekly weather variables, Maturity Group, and genotype cluster – was included (Table 10 in Supplementary Materials, Figure 4). Including only weather data resulted in higher RMSE (e.g., 8.289 for Stacked LSTM with Tx=30), while models using only MG or Cluster performed much worse (RMSE > 15). This demonstrates the value of integrating genetic and environmental factors.
- Comparison with USDA Model: The deep learning model also showed significantly lower absolute errors compared to the data-driven state-of-the-art USDA model for state-wise average yield predictions over multiple years (Table 11 in Supplementary Materials).
- Explainability via Attention: The Temporal Attention model provided insights into the relative importance of different time windows during the growing season (Figure 5). It showed that weather variables later in the season (August-September), coinciding with reproductive phases, were more important for yield prediction, especially for higher-yielding genotypes.
- Weather Variable Importance: The greedy search revealed different rankings of weather variable importance depending on the region (North vs. South) and whether MG/Cluster were included initially. Minimum surface temperature was highlighted as potentially more important than previously thought, suggesting the significance of nighttime temperatures.
Practical Implications and Applications
The framework presented has several practical applications:
- Plant Breeding: The model can serve as a "hypotheses generation tool" to understand complex GxExM interactions. By conducting sensitivity analyses and "what-if" scenarios, breeders can identify key environmental factors influencing yield plasticity, select superior varieties for specific conditions, and design targeted breeding programs for climate resilience.
- Agricultural Production: The model can provide more accurate and finer-resolution (location-specific) yield predictions compared to traditional state-level models. This information can aid farmers in making marketing decisions, grain cooperatives in logistics planning, and crop insurance providers in risk assessment.
- Research: The attention mechanism provides insights into critical growth stages and environmental variables, guiding further biological and physiological research. For example, the finding about minimum surface temperature importance warrants dedicated studies.
- Data Integration: The approach demonstrates a strategy for effectively integrating diverse data sources (temporal weather, categorical genotype traits, pedigree-based relatedness) for complex agricultural prediction tasks.
Implementation Considerations
- Data Requirements: The model relies on a large volume of historical data spanning multiple years, locations, and genotypes. Access to detailed, cleaned data including yield, weather records, and genotype information (pedigree or molecular markers) is crucial.
- Feature Engineering: Preprocessing steps like downsampling weather data to appropriate time intervals (e.g., weekly) and creating genotype features (like clusters from pedigree) are necessary.
- Model Architecture: Implementing stacked LSTMs and the temporal attention mechanism requires understanding recurrent neural networks and attention layers. Standard deep learning libraries like TensorFlow or PyTorch provide the necessary components.
- Computational Resources: Training deep learning models, especially on large datasets and with sequential data, can be computationally intensive, requiring GPUs.
- Interpretability: While the attention mechanism offers insights into temporal importance, interpreting the interactions between multiple weather variables and genotype features within the LSTM layers remains challenging.
- Scalability: The approach can be scaled to include more data sources (soil, management, remote sensing) and potentially applied to other crops.
Limitations and Future Work
The paper acknowledges limitations:
- The current genotype representation (MG and K-means clusters from pedigree) is a proxy for genetic variation. Integrating high-resolution molecular marker data (e.g., SNPs) could improve genotype-specific predictions.
- The model does not currently include other important factors influencing yield, such as soil properties, management practices (irrigation, fertilization, planting date, row spacing), or disease/pest pressure. Including these factors is a key area for future improvement.
- While the attention mechanism provides temporal insights, establishing definitive causality from deep learning models is an open research problem. The authors propose using DL models for hypothesis generation to guide empirical validation.
Future work includes integrating genomic data, incorporating additional environmental and management factors, leveraging sensor and remote sensing data for physiological traits, and further exploring the use of DL models for generating testable hypotheses about GxExM interactions and climate change impacts.
In summary, this research successfully demonstrates a deep learning approach using LSTMs and attention to build accurate and somewhat interpretable models for soybean yield prediction by effectively integrating temporal weather data and genotype information derived from pedigree. The framework serves as a powerful tool for data assimilation, prediction, and hypothesis generation in plant science and breeding.