MMST-ViT: Climate Change-aware Crop Yield Prediction via Multi-Modal Spatial-Temporal Vision Transformer (2309.09067v2)

Published 16 Sep 2023 in cs.CV

Abstract: Precise crop yield prediction provides valuable information for agricultural planning and decision-making processes. However, timely predicting crop yields remains challenging as crop growth is sensitive to growing season weather variation and climate change. In this work, we develop a deep learning-based solution, namely Multi-Modal Spatial-Temporal Vision Transformer (MMST-ViT), for predicting crop yields at the county level across the United States, by considering the effects of short-term meteorological variations during the growing season and the long-term climate change on crops. Specifically, our MMST-ViT consists of a Multi-Modal Transformer, a Spatial Transformer, and a Temporal Transformer. The Multi-Modal Transformer leverages both visual remote sensing data and short-term meteorological data for modeling the effect of growing season weather variations on crop growth. The Spatial Transformer learns the high-resolution spatial dependency among counties for accurate agricultural tracking. The Temporal Transformer captures the long-range temporal dependency for learning the impact of long-term climate change on crops. Meanwhile, we also devise a novel multi-modal contrastive learning technique to pre-train our model without extensive human supervision. Hence, our MMST-ViT captures the impacts of both short-term weather variations and long-term climate change on crops by leveraging both satellite images and meteorological data. We have conducted extensive experiments on over 200 counties in the United States, with the experimental results exhibiting that our MMST-ViT outperforms its counterparts under three performance metrics of interest.

References (57)

Citations (14)

View on Semantic Scholar

Summary

The paper introduces MMST-ViT, a novel vision transformer that integrates multi-modal data for county-level crop yield prediction in a climate change context.
It employs a three-component architecture—multi-modal, spatial, and temporal transformers—to capture both short-term weather and long-term climate effects.
Experiments on US soybean data achieved an RMSE of 3.9, R² of 0.843, and a 0.918 correlation, demonstrating robust predictive accuracy.

MMST-ViT: A Vision Transformer Approach for Climate Change-aware Crop Yield Prediction

Introduction

Predicting crop yields accurately is vital for effective agricultural planning, economic decisions, and ensuring global food security. Such predictions are notably challenging due to the sensitivity of crop growth to weather variations and climate change. This paper introduces the Multi-Modal Spatial-Temporal Vision Transformer (MMST-ViT), a novel approach that integrates both visual remote sensing data and meteorological data to predict crop yields at the county level across the United States. The model specifically aims to capture the effects of short-term meteorological variations during the growing season and long-term climate change.

Methodology

The proposed MMST-ViT model comprises three key components: (1) a Multi-Modal Transformer that processes both visual and short-term meteorological data to model the impact of growing season weather variations; (2) a Spatial Transformer to learn high-resolution spatial dependencies for accurate agricultural tracking; and (3) a Temporal Transformer that captures the long-term temporal dependence, representing the effect of climate change on crop growth.

A notable contribution of this work is the development of a novel multi-modal contrastive learning technique. This technique enables efficient pre-training of the model without extensive labeling effort, leveraging the multi-modal nature of the input data (satellite images and meteorological parameters).

Results and Discussion

Extensive experiments were conducted on data from over 200 counties in the United States, showing MMST-ViT's superior performance over existing approaches. For instance, on soybean yield prediction, the MMST-ViT model achieved the lowest RMSE of 3.9, the highest R-squared value of 0.843, and the best correlation coefficient of 0.918, indicating its robust predictive capability.

Furthermore, visualizations of prediction errors across multiple U.S. states demonstrate the model's consistent accuracy in various geographical regions. Early prediction experiments, predicting crop yields one year ahead, also showed promising results, proving the model's robustness and practical utility for agricultural planning and decision-making processes.

Implications and Future Developments

The integration of multi-modal data (visual remote sensing and meteorological data) presents a significant step forward in crop yield prediction. It not only enhances the accuracy of the predictions but also provides insights into the complex interplay between crop growth, weather variations, and climate change.

Looking ahead, this work opens up numerous avenues for future research. One potential direction is to explore the impact of additional data modalities (such as soil moisture and nutrient levels) on prediction accuracy. Another area of interest could be the development of more sophisticated multi-modal pre-training techniques to further improve the model’s performance and generalize across different types of crops and geographic regions.

The MMST-ViT framework offers a promising model for leveraging advanced machine learning techniques to address the pressing challenges posed by climate change on agriculture. By continuing to refine and expand upon this work, there is a potential for significant positive impacts on food security, agricultural sustainability, and economic planning on a global scale.

PDF Markdown