LLMs Can Teach Themselves to Better Predict the Future (2502.05253v1)

Published 7 Feb 2025 in cs.CL and cs.AI

Abstract: We present an outcome-driven fine-tuning framework that enhances the forecasting capabilities of LLMs without relying on human-curated reasoning samples. Our method leverages model self-play to generate pairs of diverse reasoning trajectories and probabilistic forecasts for a set of diverse questions that resolve after the models' knowledge cutoff date. We then rank pairs of these reasoning traces by their distance to the actual outcomes before fine-tuning the model via Direct Preference Optimization (DPO). On a separate test set, our approach increases prediction accuracy of Phi-4 14B and DeepSeek-R1 14B by between 7--10\% over a base model and a DPO fine-tuned control model with randomized labels, bringing them on par with forecasting capabilities of much larger frontier models like GPT-4o.

Summary

The paper introduces an outcome-driven fine-tuning framework enabling large language models to improve future forecasting accuracy using self-play and Direct Preference Optimization.
The method generates diverse forecasts, ranks them by proximity to actual outcomes from prediction markets like Polymarket, and fine-tunes the model on these ranked pairs.
This approach significantly improves forecasting accuracy (lower Brier scores) compared to base models and matches the performance of larger frontier models without requiring human-curated data.

The paper "LLMs Can Teach Themselves to Better Predict the Future" introduces an outcome-driven fine-tuning framework that enhances the forecasting accuracy of LLMs by using model self-play and Direct Preference Optimization (DPO). The method generates diverse reasoning trajectories and probabilistic forecasts for questions that resolve after the model's knowledge cutoff date. These reasoning traces are then ranked by their proximity to actual outcomes, and the model is fine-tuned using DPO.

The authors note that while LLMs have demonstrated capabilities in many areas, they have not yet surpassed human performance in judgemental forecasting. Improving LLMs' forecasting abilities could have significant impact across sectors like finance, policy, and law.

The paper addresses the limitations of existing approaches that often rely on human-curated data, which is slow and costly to procure. The proposed method sidesteps human input by enabling the model to directly learn from actual outcomes and self-play. By having the model generate reasoning and forecasts on a large number of questions, a substantial dataset is created that can be used for further training, making the approach scalable.

The paper uses a large dataset of resolved prediction market questions from Polymarket. The model generates multiple reasoning traces and probabilistic forecasts through self-play, constrained by a historical cutoff date. These rationales are then ranked based on their proximity to the resolved outcome. For instance, a 5% prediction is ranked higher than a 10% prediction if the event resolves to "^{^{^{^{1^{^{^{^"}}}}}}} The model is then fine-tuned on these ranked pairs and tested on a separate test set. This ensures that the model learns from the full set of forecasts needed for a well-calibrated forecasting model.

The methodology comprises six main steps:

Collection and preprocessing of forecasting data
News collection
Synthetic training data generation through base model self-play
Resolution-driven re-ranking
DPO fine-tuning
Forecasting test-set questions

Two models were used for self-play and forecasting: Phi-4 14B and DeepSeek-R1-Distill-Qwen-14B, referred to as DeepSeek-R1 14B. Both models have 14B parameters and have demonstrated strong performance on general science and coding benchmarks.

A total of 12,100 binary outcome forecasting questions were collected from Polymarket, excluding outcomes with ambiguous resolutions. The training set included 9,800 questions resolving between July 1 and December 15, 2024, and the test set included 2,300 questions resolving between December 25, 2024, and January 23, 2025. The final outcomes for all questions were recorded as 0 (did not happen) or 1 (did happen).

The accuracy of the probabilistic forecasts is evaluated using Brier scores. The Brier score is defined as:

$BS = \frac{1}{N}\sum (p_i - o_i)^2$

where:

$BS$ is the Brier score
$N$ is the total number of forecasting questions
$p_i$ is the predicted probability for question $i$
$o_i$ is the actual outcome for question $i$ , where $o_i \in \{0,1\}$

A lower Brier score indicates higher forecasting accuracy.

News was collected via the NewsCatcher Application Programming Interface (API) 14 days prior to question resolution, generating search queries via GPT-4o and integrating external news retrieval services. The news articles were used as further input in subsequent steps.

The base models were instructed to provide reasoning and a probabilistic forecast for each question. A scratchpad prompt was used for Phi-4 14B, while a zero-shot prompt was used for DeepSeek-R1 14B. The prompt included a summary of news along with the appropriate scratchpad or zero-shot prompt depending on the model. Two reasoning traces were generated for each question. Overall, 18,854 reasoning traces were obtained for the 9,427 forecasting questions that had non-constant forecasts.

For each question, reasoning–outcome pairs were ranked based on the proximity of the probabilistic forecast to the ground truth. For each question with ground truth $o \in \{0,1\}$ , the probabilistic forecasts from two reasoning traces are denoted by $p_1$ and $p_2$ (with $p_i \in [0,1]$ ). A ranking metric is defined as:

$r(p,o) = |p - o|$

where:

$r$ is the ranking metric
$p$ is the probabilistic forecast
$o$ is the ground truth outcome

This metric measures the absolute difference between the forecast and the actual outcome. Pairs that resulted in identical forecasts were removed prior to this stage. The full set of 18,854 reasoning traces for the 9,427 forecasting questions were used for re-ranking.

To control for the possibility that information provided via news aggregation might influence the rankings, a second set of models was fine-tuned with randomized labels. These control models test whether the learning is attributable to the models learning from higher-accuracy forecasting rationales.

Phi-4 14B and DeepSeek-R1 14B were fine-tuned using the preference pairs. DPO was used to optimize model outputs against self-play derived and outcome-driven preferences without the need to train a separate reward model. The DPO loss was minimized using a Low-Rank Adaptation (LoRA) adapter (rank=16, alpha=32, dropout=0.05, target_modules="all-linear", no bias) on top of the base model, which was held in 4-bit quantisation, using a batch size of 2 (with 4 gradient accumulation steps) and gradient checkpointing enabled. Training leveraged the AdamW optimiser with a linear learning rate scheduler (5e-5 base rate), beta=0.1, and BF16 mixed precision. 8 H100 Graphics Processing Units (GPUs) were used for training. For Phi-4 14B, a plateau was found at the fifth epoch, while this occurred at the fourth epoch for DeepSeek-R1 14B.

Every model was tested against a held-out test set of 2300 questions. This test set began 10 days after the final outcome in the training set, ensuring that the fine-tuned models had not been exposed to any news that might inform outcomes in the test set. This was done with three versions of each model: the original base model, the fine-tuned model with correct outcomes for DPO ranking, and a control fine-tuned model with randomized outcomes for DPO ranking.

The fine-tuned models demonstrated substantial improvements in forecasting accuracy. For Phi-4 14B, the fine-tuned model achieved a mean Brier score of 0.200 (SD = 0.218; 95% CI [0.191, 0.209]), outperforming both the randomized-label control model (M = 0.214, SD = 0.186; 95% CI [0.206, 0.221]) and the base model (M = 0.221, SD = 0.189; 95% CI [0.214, 0.229]). Similarly, DeepSeek-R1 14B attained a mean Brier score of 0.197 (SD = 0.218; 95% CI [0.188, 0.206]) after fine-tuning, surpassing both its randomized-label control (M = 0.212, SD = 0.202; 95% CI [0.204, 0.220]) and base counterparts (M = 0.212, SD = 0.201; 95% CI [0.204, 0.220]).

Independent samples t-tests between the fine-tuned versions of the models and both the base and control models, as well as the frontier model benchmark set by GPT-4o, showed that for both Phi-4 14B and DeepSeek-R1 14B, the fine-tuned model was statistically significantly more accurate than both the base and control models at $p<0.05$ . This also held after adjusting the p-values for multiple comparisons via the Benjamini-Hochberg procedure. No statistically significant differences were observed between the fine-tuned models and the frontier model benchmark set by GPT-4o ( $p>0.7$ for both after adjustment).

Comparing the distributions of accuracy scores across the questions for DeepSeek-R1 14B, the fine-tuned model had a Brier score above 0.5 on 8.52% of questions, slightly higher than the base (7.48%) and control (7.61%) models. However, it also had a Brier score below 0.05 on 32.78% of questions, compared to only 23.22% and 23.13% for the base and control models. This pattern was replicated for Phi-4 14B, where the fine-tuned model had 8.87% of forecasts above 0.5 but 35.7% below 0.05, compared to 7.26% and 21% for the base model and 6.43% and 20.39% for the control model, respectively.

The paper concludes that LLMs can enhance their forecasting capabilities through self-play, generating reasoning traces that enable outcome-based fine-tuning without relying on human-curated data. By pairing these traces and ranking them by their proximity to actual outcomes, the models learn to refine their probabilistic forecasts, outperforming base models and matching the performance of larger frontier models.