OpenForesight: Open-ended Forecasting Dataset

Updated 7 January 2026

OpenForesight is a rigorously curated dataset of open-ended forecasting questions derived from global news, designed to evaluate future-resolving language models.
It employs automated synthesis and filtering pipelines to generate leakage-minimized, high-quality question-answer pairs for realistic prediction tasks.
The dataset supports probabilistic prediction, calibration, and retrieval-augmented modeling with comprehensive evaluation metrics and reproducible workflows.

OpenForesight is a large-scale, rigorously curated dataset of open-ended forecasting questions derived from global news, designed explicitly for training and evaluating LLMs in forward-looking reasoning tasks. Developed within the Qwen3 ecosystem, it employs fully automated synthesis and curation pipelines to generate high-quality, leakage-minimized question-answer pairs focused on real-world, future-resolving events, thus facilitating research into probabilistic prediction, calibration, and retrieval-augmented language modeling (Chandak et al., 31 Dec 2025).

1. Automated Data Generation and Validation Pipeline

OpenForesight’s construction leverages the CommonCrawl News (CCNews) corpus, comprising ∼248,000 de-duplicated English-language news articles published between June 2023 and April 2025, representing sources such as Forbes, CNN, Hindustan Times, Deutsche Welle, and the Irish Times (Table A.1). Articles are filtered for language, valid publish dates, and text availability; approximately 744,963 raw question candidates are initially created via a two-stage process:

Sample Creation (DeepSeek-V3): For each article, up to three forecasting samples are synthesized—each structured as:
- question_title: Open-ended, forward-looking (“Who/What/When/Where”) question.
- background: Concise contextualization, introducing uncommon terms.
- resolution_criteria: Structured resolution protocol specifying:
- Source of Truth (e.g., “Official announcement from the Verkhovna Rada…”)
- Resolution Date (e.g., “17 July 2025”)
- Accepted Answer Format (e.g., “Full name exactly as given…”)
- answer: 1–3 word verbatim answer, non-numeric, from the article.
- answer_type: e.g., “String (Name)”
- source_link: Article URL.
Filtering & Validation (Llama-4-Maverick):

Validation of sample tense, answer definiteness, and exclusion of late reporting.
Retention of at most one valid sample per article (discarding ∼60% of candidates).
Leakage Removal: Automatic redaction or dropping of questions whose context or criteria mention the answer string (removing 90% of such direct leaks).
Strict exclusion of numeric answers; retention restricted to questions resolving post–1 Jan 2024.

Ultimately, the pipeline yields 52,183 rigorously constructed question-answer records suitable for model training.

2. Dataset Composition and Temporal Splits

OpenForesight’s data composition aligns with high standards for open-ended evaluation:

Training Set: 52,183 unique forecasting samples, each linked to a single, ground-truth answer.
Validation Set: 207 manually validated questions from 500 Guardian articles (July 2025).
Test Set: 302 questions, drawn from five distinct outlets (May–Aug 2025), manually verified for leakage resistance.
External Benchmark: The FutureX dataset (86 English, non-numeric resolved questions, July–Aug 2025), supporting further out-of-domain assessment.

Answer-type distribution (Table A.3):

Answer Type	Count	Percent
String (Name)	32,213	44.8 %
String (Location)	14,337	20.0 %
Other (Title/Org/Team)	Remainder	35.2 %

Data Splits & Temporal Coverage:

Train: Events resolving by April 30, 2025, from Forbes, CNN, HT, DW, Irish Times.
Validation: July 2025 resolutions (The Guardian).
Test: May–August 2025 tournaments (distinct, leakage-controlled outlets).

This structuring supports both chronologically safe training and robust, forward-testing.

3. Data Format and Input/Output Schema

Each sample is encoded in a custom XML-like record, capturing all resolution metadata and facilitating retrieval-augmented inference. A canonical example:

<q1>
  <question_id>0</question_id>
  <question_title>Who will be confirmed as the new prime minister of Ukraine by 17 July 2025?</question_title>
  <background>Ukraine’s parliament is scheduled to vote to appoint a new prime minister.</background>
  <resolution_criteria>
    <ul>
      <li><b>Source of Truth</b>: Official announcement from the Verkhovna Rada…</li>
      <li><b>Resolution Date</b>: 17 July 2025.</li>
      <li><b>Accepted Answer Format</b>: Full name…</li>
    </ul>
  </resolution_criteria>
  <answer>Yulia Svyrydenko</answer>
  <answer_type>String (Name)</answer_type>
</q1>

At inference, models produce:

$y$ : Free-form (short text) answer.
$q \in [0, 1]$ : Confidence score.

Inference and evaluation utilize retrieval-augmented prompts, with retrieved CCNews passages included to ground each prediction.

4. Retrieval-Augmented Modeling and Training Strategies

OpenForesight enables investigation into retrieval-augmented transformer architectures for open-ended prediction. The retrieval and prompt scheme is as follows:

Embedding and Retrieval: All CCNews chunks (up to 512 tokens each) are embedded using the Qwen3-8B embedding model.
Inference-Time Retrieval: For each question, the top $k=5$ non-leaking news chunks (published at least one month prior to the resolution date) are retrieved to prevent information leakage.
Training-Time Robustness: Prompts include between 0–5 retrieved chunks, randomly mixed, to regularize retrieval-awareness.

Prompt template (evaluation phase):

You will be asked a forecasting question…
Question Title: {question_title}
Background: {background}
Resolution Criteria: {resolution_criteria}
Retrieved passages:
{top_k_chunks}

Think step by step… then output:
<answer>…</answer>
<probability>…</probability>

This tight coupling between retrieval context and forecasting supports analysis of grounded reasoning, hallucination resistance, and information utilization under uncertainty.

5. Evaluation Metrics and Empirical Findings

OpenForesight specifies rigorous metrics for evaluation on both discrete and probabilistic performance axes:

Accuracy:

$\mathrm{Acc} = \mathds{1}_{y \equiv y^*}$

Free-form Brier Score (proper scoring rule):

$S'(q, y, y^*) = \begin{cases} 1 - (q - 1)^2, & y \equiv y^* \ -q^2, & y \neq y^* \end{cases}$

Reinforcement Learning (RL) Reward Variants:
- Accuracy only: $R = \mathds{1}_{y \equiv y^*}$
- Brier only: $R = S'(q, y, y^*)$
- Accuracy + Brier (optimal): $R = \mathds{1}_{y \equiv y^*} + S'(q, y, y^*)$

Consistency metrics (Paleka et al 2025) capture arbitrage and frequentist violations across logical checks. RL-trained models on OpenForesight reduce arbitrage errors by 44% and frequentist errors by 19% (Table 6).

Empirical Highlights:

Scaling data improves both accuracy and Brier performance, with Llama-3.1-8B achieving parity with much larger proprietary models at ∼50k samples.
Retrieval integration leads to a 9–18% accuracy increase (Fig 5).
Combined accuracy and Brier reward in RL training outperforms single-objective variants.
On the final held-out test and the FutureX benchmark, OpenForecaster-8B achieves the lowest Brier score among open-weight models and the highest accuracy.

6. Access, Code, and Reproducibility

The dataset, models, and training pipelines are fully open-source:

Dataset and Model Weights: HuggingFace Collection
Codebase and Training Recipes: GitHub repository
Project Webpage: openforecaster.github.io

Programmatic Access Example:

from datasets import load_dataset
ds = load_dataset("nikhilchandak/openforecaster", "train")

from openforecaster import Retriever, Forecaster
retriever = Retriever(corpus_path="ccnews_index")
forecaster = Forecaster(model="qwen3-8b-forecaster")

query = ds["train"][0]["question_title"]
bg = ds["train"][0]["background"]
rc = ds["train"][0]["resolution_criteria"]

chunks = retriever.get_top_k(query, k=5)
answer, prob = forecaster.predict(query, bg, rc, context=chunks)
print(answer, prob)

OpenForesight thus establishes a robust platform for advancing research into realistic, retrieval-powered, open-ended forecasting by LLMs, providing detailed evaluation protocols and leakage-aware pipeline design for reproducibility and benchmarking (Chandak et al., 31 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Scaling Open-Ended Reasoning to Predict the Future (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OpenForesight Dataset.