Approaching Human-Level Forecasting with Language Models (2402.18563v1)

Published 28 Feb 2024 in cs.LG, cs.AI, cs.CL, and cs.IR

Abstract: Forecasting future events is important for policy and decision making. In this work, we study whether LLMs (LMs) can forecast at the level of competitive human forecasters. Towards this goal, we develop a retrieval-augmented LM system designed to automatically search for relevant information, generate forecasts, and aggregate predictions. To facilitate our study, we collect a large dataset of questions from competitive forecasting platforms. Under a test set published after the knowledge cut-offs of our LMs, we evaluate the end-to-end performance of our system against the aggregates of human forecasts. On average, the system nears the crowd aggregate of competitive forecasters, and in some settings surpasses it. Our work suggests that using LMs to forecast the future could provide accurate predictions at scale and help to inform institutional decision making.

PDF HTML Abstract

Approaching Human-Level Forecasting with LLMs

Introduction to Automated Forecasting

In the domain of forecasting, the traditional dichotomy has been between statistical forecasting and judgmental forecasting, the latter relying heavily on human expertise to factor in domain-specific knowledge, intuition, and contextual considerations. This research paper presents an innovative leap in the application of LLMs (LMs) for judgmental forecasting, aiming to harness their vast pre-trained knowledge and reasoning capabilities. By developing a retrieval-augmented forecasting system, the authors endeavor to automate the intricacies of generating, weighing, and synthesizing forecasts that traditionally necessitated human intervention.

Methodology and System Design

The core of the proposed system relies on three integral components: retrieval, reasoning, and aggregation. The retrieval component is tasked with sourcing relevant articles to inform the forecast, addressing the challenge of keeping the models current with events post their last training data cut-off. This is followed by a reasoning step, where the system, leveraging the retrieved articles, generates probabilistic forecasts and their justifications. Finally, an aggregation process synthesizes these individual outputs into a singular prediction. The innovation extends to a self-supervised fine-tuning approach aimed at enhancing the model's forecasting accuracy and reasoning fidelity by iteratively improving its performance based on real-world forecasting questions from competitive platforms.

Data Collection and Evaluation

A uniquely compiled dataset, encompassing a wide range of forecasting questions sourced from competitive forecasting platforms, serves to validate and fine-tune this system. Importantly, the questions span beyond the knowledge cut-off of the pre-trained models, ensuring an authentic testing ground for the system's forecasting capabilities. Using the Brier score metric for evaluation, the system demonstrates a near-human performance level, occasionally surpassing aggregated human forecasts under certain conditions.

Strengths, Limitations, and Future Directions

The robust evaluation suggests notable strengths of the system, particularly in contexts of high uncertainty among human forecasters or when ample relevant information could be retrieved. Conversely, the system shows potential deficits when forced to predict without sufficient context or on topics heavily reliant on recent events beyond its training cut-off. These insights beckon further exploration into iterative self-supervision, domain-adaptive training, and leveraging future LM iterations for improved forecasting.

Conclusion and Implications

This paper advances the discourse on the potential of LMs in automating judgmental forecasting, offering a scalable and efficient alternative to purely human-driven approaches. The implications for policy-making, business strategy, and more broadly, decision-making processes are profound, envisaging a future where informed decisions can be bolstered by automated, yet highly accurate, forecasting support. The research sets a promising trajectory for further refining these systems, with the prospect of achieving parity or exceeding human forecasting capabilities in a broader array of contexts.