ForecastBench: Dynamic AI Forecast Benchmark
- ForecastBench is a dynamic benchmarking framework assessing AI, human, and crowd forecasting abilities using continuously updated, contamination-free event predictions.
- It rigorously evaluates forecasts using metrics like the Brier score and log score to provide actionable insights on accuracy and calibration across populations.
- An automated pipeline scrapes and filters over 5,900 forecasting questions from diverse sources, synthesizing combination questions to probe reasoning on event dependencies.
ForecastBench
ForecastBench denotes both a specific benchmark introduced in "ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities" and, more generally, a lineage of benchmarking frameworks used to assess the forecasting ability of machine learning systems on future-oriented event or time-series prediction tasks. Its core innovation is the rigorous, contamination-free, and dynamically updated evaluation of probabilistic forecasts—enabling direct comparison between AI systems, expert humans, and broader public populations in genuine real-time settings (Karger et al., 2024).
1. Motivation and Benchmarking Philosophy
Forecasts of future events inform critical decisions in finance, public health, policy, and science. Existing static benchmarks rapidly become obsolete as the knowledge cutoff of state-of-the-art models advances, and even more crucially, static resolved-question datasets introduce contamination risks—models may be exposed to training data that leaks ground-truth labels prior to test time, invalidating experimental conclusions. The ForecastBench framework addresses these limitations through three principles:
- Continuous Update: New forecasting questions are continually ingested and unresolved at the time of forecast submission.
- Contamination-Free Guarantee: All test questions concern events whose resolution is unknown when they are issued, eliminating the avenue for data leakage via pretrained weights or retrieval.
- Cross-Population Comparison: Performance is simultaneously characterized for LLMs, human generalists, and elite forecasters, providing explicit baselines for both superhuman and subhuman performance bands (Karger et al., 2024).
2. Benchmark Construction and Question Sourcing
Every two weeks, ForecastBench samples 1,000 new forecasting questions from a live, curated bank of over 5,900 open problems drawn from nine heterogeneous sources:
- Prediction Markets: RFI, Manifold, Metaculus, Polymarket.
- Time-Series and Event Datasets: ACLED (conflict), DBnomics (economics), FRED (macro), Wikipedia (page-views), Yahoo! Finance (stock events) (Karger et al., 2024).
An automated pipeline performs scrape–filter–categorize operations nightly. Low-quality or invalid questions are screened via a GPT-3.5-based filter, each valid item is annotated with structured metadata (topic, source, resolution criteria, and, for data-derived items, a "freeze value" encoding the most recent market or metric at submission). Additionally, binary combination questions are synthesized by logical conjunctions/disjunctions of base events, directly probing the ability of forecasters to reason about event dependencies.
A 200-question subset, sampled for balance, is administered to both superforecasters and the general public. Market-derived questions typically admit a single probabilistic forecast; time series-based questions are forecasted at eight recall horizons spanning 7 days to 10 years.
3. Evaluation Protocol and Metrics
Each submission is a (pseudo-)probabilistic prediction representing the subjective likelihood of event resolution . Forecasts are compared using the Brier score:
Aggregate system performance is mean Brier score over all questions in the relevant split. Lower values denote better accuracy, with a uniform-ignorance baseline () yielding $0.25$ (Karger et al., 2024, Alur et al., 10 Nov 2025).
Additional diagnostics include:
- Log Score:
- Calibration plots: Comparing forecasted probabilities to empirical outcome frequencies via binning.
- Bootstrap Confidence Intervals: Leaderboards include CIs and pairwise p-values for all forecasters (via question-wise bootstrapping).
Unresolved market questions are provisionally scored against the current crowd-forecast, enabling early signal detection; all are re-scored against final resolution values post-event.
4. Human and Machine Baseline Experiments
Three populations are quantitatively benchmarked:
- General Public: Large-scale online recruitment (Prolific, Facebook), with each participant forecasting a random subset and per-question aggregation via median.
- Expert Superforecasters: Tournament-qualified forecasters answering in an individually randomized, then collaboratively refined, manner.
- LLM Systems: Seventeen SOTA foundation models (GPT-3.5, GPT-4/Turbo, Claude-2/3, Gemini-1.5, Llama-2/3, Mistral-Large, Qwen-1.5), assessed under multiple prompting regimes (zero-shot, scratchpad reasoning, retrieval-augmented, LLM-ensemble).
Within a 393-problem test set (316 dataset + 77 market):
| Forecaster | Brier (mean, 95% CI) |
|---|---|
| Superforecasters | 0.092 [0.073, 0.112] |
| Public | 0.114 [0.095, 0.133] |
| Best LLM (Claude) | 0.114 [0.092, 0.136] |
Superforecasters significantly outperform both LLMs and the public (). On the full 1,000-question LLM split, the gap increases further: superforecasters extrapolated to combination questions achieve , versus top LLMs at (Karger et al., 2024). Bootstrapped statistical significance and interval estimates are included for all results. Even with crowd-forecast information provided, SOTA LLMs (e.g., Claude-3.5-Sonnet) lag by Brier points.
5. Implications, Failure Modes, and Limitations
The persistent performance gap demonstrates that, despite superhuman language and reasoning abilities on conventional benchmarks, current LLMs remain inferior to elite human forecasters on authentic, real-time event prediction. Key challenges include:
- Reasoning about event dependency: Especially in combination questions, LLMs exhibit notable failures in covariance reasoning.
- Recency and fact updating: Without pipeline integration for continual retrieval or learning, LLMs underperform on fast-moving or emergent topics relative to human/crowd baselines.
- Calibration and aggregation: While ensembling and structured prompting bring LLMs closer to human performance, proper extremization and post-hoc calibration remain necessary for competitive sharpness (Alur et al., 10 Nov 2025).
Observed correlations (e.g., with Chatbot Arena ratings and with estimated training compute) suggest that further scale and architectural improvements could eventually bridge the gap.
6. Derivative Benchmarks and Extensions
ForecastBench's methodology has catalyzed a suite of specialized or extended evaluation frameworks:
- FinDeepForecastBench (Li et al., 8 Jan 2026): Extends ForecastBench to finance, with task diversity (corporate/macro, recurrent/non-recurrent) and strict temporal isolation; leverages live multi-agent architectures and public leaderboards.
- AIA Forecaster (Alur et al., 10 Nov 2025): Achieves superforecaster-level performance on ForecastBench via agentic search, supervisor-based aggregation, and extremization (Platt scaling); demonstrates the value of active, evidence-driven LLM workflows for high-stakes prediction.
- Bench to the Future (BTF) (FutureSearch et al., 11 Jun 2025): Proposes a pastcasting protocol using historical snapshots and known resolutions to enable repeatable, offline evaluation.
- FOReCAst (Yuan et al., 27 Feb 2025): Assesses both prediction and confidence calibration in Boolean, timeframe, and quantity estimation tasks, leveraging human crowd calibration scores.
ForecastBench is often referenced as the canonical general-event forecasting benchmark for AI, and its design principles have been adopted by domain-specific environments (epidemic (Srivastava et al., 2021), finance (Li et al., 8 Jan 2026), energy, ocean, and weather).
7. Future Directions
Planned upgrades to ForecastBench involve:
- Expanding question types to include multinomial, multi-step, and cost-sensitive forms.
- Enhanced support for conditional and time-varying forecasts.
- Community-driven leaderboards, standardized APIs for streamlined model submission, and fine-tuning protocols based on accumulated rationales and historical scores.
- Refined calibration-aware prompting and advanced scorecards incorporating not only Brier/log-loss but also decision-centric and structural reliability metrics.
ForecastBench is publicly released (MIT license) and actively maintained for at least three years. All code, leaderboards, and forecasts are open for community audit and extension (https://www.forecastbench.org), with continuous integration pipelines and robust versioning to ensure the ongoing integrity and evolution of the benchmark (Karger et al., 2024).