A machine learning methodology for real-time forecasting of the 2019-2020 COVID-19 outbreak using Internet searches, news alerts, and estimates from mechanistic models (2004.04019v1)

Published 8 Apr 2020 in stat.OT, cs.LG, q-bio.PE, and stat.ML

Abstract: We present a timely and novel methodology that combines disease estimates from mechanistic models with digital traces, via interpretable machine-learning methodologies, to reliably forecast COVID-19 activity in Chinese provinces in real-time. Specifically, our method is able to produce stable and accurate forecasts 2 days ahead of current time, and uses as inputs (a) official health reports from Chinese Center Disease for Control and Prevention (China CDC), (b) COVID-19-related internet search activity from Baidu, (c) news media activity reported by Media Cloud, and (d) daily forecasts of COVID-19 activity from GLEAM, an agent-based mechanistic model. Our machine-learning methodology uses a clustering technique that enables the exploitation of geo-spatial synchronicities of COVID-19 activity across Chinese provinces, and a data augmentation technique to deal with the small number of historical disease activity observations, characteristic of emerging outbreaks. Our model's predictive power outperforms a collection of baseline models in 27 out of the 32 Chinese provinces, and could be easily extended to other geographies currently affected by the COVID-19 outbreak to help decision makers.

View on arXiv

Authors (8)

Dianbo Liu (59 papers)
Leonardo Clemente (3 papers)
Canelle Poirier (2 papers)
Xiyu Ding (4 papers)
Matteo Chinazzi (12 papers)
Jessica T Davis (1 paper)
Alessandro Vespignani (40 papers)
Mauricio Santillana (16 papers)

Citations (93)

View on Semantic Scholar

Summary

Machine Learning Methodology for Real-Time COVID-19 Forecasting Using Diverse Data Sources

The paper under discussion presents an intricate machine-learning approach designed to deliver real-time forecasts of COVID-19 incidences using an overview of digital traces and mechanistic model estimates. This paper is particularly focused on the application of this methodology to the COVID-19 outbreak timelines of 2019-2020, with a geographical focus on Chinese provinces. It leverages an innovative combination of internet searches, news media activity, and epidemiological data to provide accurate short-term forecasts. This approach notably enhances predictive accuracy by integrating these disparate data sources with mechanistic models, effectively addressing some persistent challenges in outbreak prediction.

The authors employ a machine-learning model, referred to as Augmented ARGONet, which integrates four principal data inputs: official disease reports from the China CDC, internet search activity data from Baidu, news reports from Media Cloud, and forecasts from the agent-based mechanistic model GLEAM. The efficacy of this methodology is highlighted through its superior performance against baseline models in 27 out of 32 Chinese provinces, with metrics evaluating both the Root Mean Square Error (RMSE) and correlation of predictions.

A central feature of the methodology is its use of a clustering algorithm to group provinces based on the synchronicity of COVID-19 activity patterns, thus enabling more precise model training and validation. Additionally, the model includes a data augmentation component to manage the notable scarcity of historical disease data during emerging outbreaks. The methodology's predictive accuracy is emphasized through its 2-day ahead forecasting capability, which is rigorously tested in strict out-of-sample settings to validate real-time applicability and robustness.

Beyond describing the methodological framework, the paper also addresses the socio-geographical context of the COVID-19 outbreak in China, specifically referencing the administrative and healthcare differences that might contribute to discrepancies in model performance across certain provinces. These insights are particularly valuable for understanding the uneven distribution of COVID-19 incidences and highlight potential areas for future refinement of model inputs and assumptions.

In terms of implications, this research presents significant advancements in harnessing digital trace data for epidemiological forecasting. The ability to produce timely, accurate forecasts during the initial phases of an outbreak could vastly improve public health responses and policy-making decisions. Additionally, the model's framework allows for adaptation to other geographical contexts, as well as applicability to future emerging infectious diseases, thereby extending its utility beyond the immediate COVID-19 context.

Looking toward future directions, the successful integration of mechanistic models with machine-learning methods opens avenues for enhancing outbreak prediction models by incorporating various other dynamic data sources, such as human mobility data once available. The model can be further refined to adapt to different social, epidemiological, and healthcare landscapes, thus broadening its applicability and increasing its impact in global health contexts.

Overall, this paper makes a substantial contribution to the field of real-time epidemic forecasting by demonstrating how diverse digital and mechanistic data integrations can improve predictive outcomes, potentially transforming how public health officials manage current and emerging infectious diseases.

PDF Markdown

Related Papers

Find Related Papers

YouTube

Show All Videos