ForecastBench Benchmark
- ForecastBench is a continuously updated benchmark that evaluates AI and human forecasting using forward-resolving, leakage-resistant questions.
- It integrates automated pipelines from market sources and datasets to generate diverse binary, empirical, and joint probability forecasting tasks.
- Empirical results show that expert forecasters significantly outperform both the general public and top LLMs, especially on combination tasks.
ForecastBench Benchmark
ForecastBench is a dynamic, continuously updated benchmark for evaluating the forecasting capabilities of ML systems and human experts on real-world, forward-resolving questions. Designed to avoid data leakage by exclusively featuring questions whose answers are truly unknown at forecast time, ForecastBench provides a rigorous framework for tracking absolute and relative progress in AI forecasting, with a public leaderboard spanning both model and human submissions (Karger et al., 2024).
1. Design Motivation and Core Principles
ForecastBench was established in response to two key limitations in prior forecasting benchmarks: (a) data-leakage and overfitting risks arising when question outcomes become discoverable by later-generation models or post-hoc resolution, and (b) obsolescence of static benchmarks as model knowledge cutoffs advance. The benchmark addresses these issues by generating and maintaining an always-fresh set of 1,000 open forecasting questions, sampled from streams of both judgmental and empirical data sources, and by strictly enforcing “future-only” eligibility: no participant, human or machine, can have access to ground-truth resolutions at the time of submission.
By refreshing the question set regularly and separating question pools for ML and human evaluation, ForecastBench enables repeatable, bias-resistant comparison between LLMs, superforecasters, and the general public, and serves as a dynamic test harness for fast-evolving AI forecasters (Karger et al., 2024).
2. Automated Question Generation, Updating, and Taxonomy
ForecastBench’s question bank is populated through automated pipelines that ingest from two parent source classes:
- Market sources: Real-time prediction markets and aggregation sites—Metaculus, Manifold Markets, Polymarket, RFI—provide 1,741 open binary or compositional questions, sampling diverse topics such as geopolitics, economics, AI developments, and public health.
- Dataset sources: Real-world time series repositories—including FRED, ACLED, DBnomics, Wikipedia, Yahoo Finance—allow generation of 4,207 empirical forecasting questions. These are systematically converted into discrete, forward-looking, probabilistic tasks.
Questions are refreshed nightly through a pipeline that scrapes, filters, and classifies new additions. Each question is automatically categorized by topic (e.g., “Economics”, “Environment”, “Science”), and associated with structured metadata including resolution criteria, freeze dates, and relevant context fields (Karger et al., 2024).
ForecastBench implements bi-weekly sampling to populate both the LLM and human evaluation sets, ensuring proportional balance across source types and topical domains. The system strictly excludes any question for which the outcome has already resolved, thereby maintaining an invariant against test contamination.
Question Types
ForecastBench exposes three principal question formats:
- Binary ("market-style") questions: “Will X occur by date Y?”—requiring an explicit probabilistic prediction .
- Empirical (dataset-derived) questions: Time-series threshold prediction at multiple horizons.
- Combination (joint probability) questions: Direct queries for covariances/conjunctions—e.g., , or more generally, full joint distributions over binary events.
Each question is structured as a JSON object with standardized fields for downstream use.
3. Evaluation Protocol and Human/LLM Participation
Each new set of 1,000 questions is released for LLM-based submissions every two weeks, with a 24-hour submission window. Forecasters are required to submit probabilistic predictions for the full set, using a predefined template. A random, stratified subset of 200 questions is concurrently reserved for human evaluation. In the human protocol, “superforecasters” (distinguished experts, typically drawn from the Quorum platform) and the general public are given 9 days and 1 hour, respectively, to submit forecasts. Each human set question receives a minimum of 40 independent public forecasts and, on average, input from 8 expert forecasters.
LLMs are evaluated under a suite of prompting strategies, including zero-shot, chain-of-thought (scratchpad), and retrieval-augmented variants. Submitted forecasts may also incorporate auxiliary information such as most recent “freeze” data points or retrieved news snippets (Karger et al., 2024).
4. Scoring Metrics and Statistical Comparison Methods
ForecastBench provides strict, proper scoring for both binary and real-valued predictions:
- Brier Score (BS):
where is the forecasted probability, and is the outcome.
- Logarithmic Score (LS):
For questions with multiple subcomponents (multi-horizon or joint-probability), ForecastBench averages the relevant scores. For still unresolved market questions, provisional scoring proxies are used until final outcomes are available.
To robustly compare forecasters, ForecastBench employs percentile bootstrap over the set of questions to estimate confidence intervals and paired p-values for the difference in aggregate score (mean Brier or Log score). The protocol assumes approximate independence across questions, with risk of anti-conservative CIs deemed mitigated by the question bank's breadth (Karger et al., 2024).
5. Key Results: Human vs. LLM Performance
Empirical evaluation on the initial 200-question human set revealed the following pattern:
- Experts (superforecasters): Median BS of $0.092$ (CI ).
- General public: Median BS of $0.114$ (statistically inferior, 0).
- Best LLM (Claude-3.5-Sonnet, scratchpad+freeze): BS of 1, not significantly better than public, but demonstrably worse than experts (2).
Retrieval-augmentation yielded no clear gains over scratchpad-only LLM baselines, and LLM ensemble methods (median, geometric mean, log-odds mean) provide modest improvements but do not close the gap to expert human forecasters.
On 1,000-question LLM leaderboards, Claude-3.5-Sonnet and GPT-4-Turbo (scratchpad+freeze) consistently lead. LLM performance distinctly degrades on combination (joint probability) questions, with Brier scores up to 3 versus superforecasters’ 4 (Karger et al., 2024).
Scaling trends show the expert-LMM performance gap remains significant; achieving parity would require projected Arena scores or compute magnitudes orders of magnitude greater than current levels.
6. Leaderboard Infrastructure, Submission, and Reproducibility
ForecastBench maintains a public leaderboard (https://forecastbench.org) that updates scores nightly as questions resolve. The leaderboard can be filtered by source type and resolution status and displays mean and median metrics per source, along with full performance breakdowns for all submitted models and human teams.
Submission workflow is as follows:
- Register and download the most recent question set (JSON).
- Generate forecasts in required JSON schema.
- Submit forecasts via web uploader or API within the prescribed window.
The entire infrastructure—codebase (MIT), data pipeline, and question datasets (CC BY-SA 4.0)—is open-source. External teams can participate in any round starting November 2024, and new models are auto-scored on upload (Karger et al., 2024).
7. Significance, Future Directions, and Limitations
ForecastBench provides a robust, forward-secure, and continuously updated platform for comparative evaluation of AI and human forecasting capabilities. Its key advances relative to prior benchmarks include absolute leakage resistance, dynamic scope, statistically robust score comparison, and direct expert-vs-LLM-vs-public evaluation. It has established, for the first time, a persistent gap between state-of-the-art LLMs (even with chain-of-thought and retrieval augmentation) and expert human forecasters, especially on covariance tasks.
Limitations include restricted coverage of non-binary or heavily conditional questions, and the assumption of statistical independence across the diverse question pool. As a living benchmark, ForecastBench will continually refresh its question set, add broader question types, and expand human cohort evaluation. Full code, data, and leaderboard access substantially lower the barrier to reproducible academic comparison (Karger et al., 2024).