Outcome-based Reinforcement Learning to Predict the Future (2505.17989v2)

Published 23 May 2025 in cs.LG and cs.AI

Abstract: Reinforcement learning with verifiable rewards (RLVR) has boosted math and coding in LLMs, yet there has been little effort to extend RLVR into messier, real-world domains like forecasting. One sticking point is that outcome-based reinforcement learning for forecasting must learn from binary, delayed, and noisy rewards, a regime where standard fine-tuning is brittle. We show that outcome-only online RL on a 14B model can match frontier-scale accuracy and surpass it in calibration and hypothetical prediction market betting by adapting two leading algorithms, Group-Relative Policy Optimisation (GRPO) and ReMax, to the forecasting setting. Our adaptations remove per-question variance scaling in GRPO, apply baseline-subtracted advantages in ReMax, hydrate training with 100k temporally consistent synthetic questions, and introduce lightweight guard-rails that penalise gibberish, non-English responses and missing rationales, enabling a single stable pass over 110k events. Scaling ReMax to 110k questions and ensembling seven predictions yields a 14B model that matches frontier baseline o1 on accuracy on our holdout set (Brier = 0.193, p = 0.23) while beating it in calibration (ECE = 0.042, p < 0.001). A simple trading rule turns this calibration edge into \$127 of hypothetical profit versus \$92 for o1 (p = 0.037). This demonstrates that refined RLVR methods can convert small-scale LLMs into potentially economically valuable forecasting tools, with implications for scaling this to larger models.

Summary

The paper presents modified GRPO and ReMax algorithms that avoid per-question reward scaling to better capture forecast errors.
It leverages a 14B parameter LLM and augmented synthetic data to deliver competitive accuracy with improved calibration.
Results show that the ReMax Ensemble-7 model achieves superior calibration and economic profit in simulated trading compared to conventional methods.

This paper, "Outcome-based Reinforcement Learning to Predict the Future" (2505.17989), explores adapting Reinforcement Learning with Verifiable Rewards (RLVR) to the complex domain of real-world forecasting, where rewards are often binary, delayed, and noisy. The authors demonstrate that a 14-billion parameter LLM can achieve forecasting accuracy comparable to frontier models and superior calibration by refining existing RL algorithms.

The core problem addressed is the brittleness of standard fine-tuning methods when applied to forecasting tasks that lack the clean, deterministic feedback common in areas like math or coding where RLVR has previously shown success. The paper proposes several key methodological adaptations to address this:

Algorithm Modifications:
- Modified Group-Relative Policy Optimisation (GRPO): The standard GRPO normalizes rewards by their per-question standard deviation. The authors remove this scaling ( $\hat{A}^i = r^i - \mu$ instead of $\hat{A}^i = (r^i - \mu) / \sigma$ ). This change is crucial because per-question normalization can dampen the impact of large forecast errors, which are vital signals for learning calibration, especially when Brier scores are intrinsically bounded (0 to 1), mitigating some instability concerns.
- ReMax with Baseline-Subtracted Advantages: ReMax uses a learned baseline ( $b^i$ ) to calculate advantages ( $\hat{A}^i = r^i - b^i$ ). This approach, like Modified GRPO, avoids per-question variance scaling, better preserving the magnitude of reward signals.
Data Augmentation: The training dataset, initially ~10,000 resolved yes/no contracts from Polymarket, was augmented with 100,000 temporally consistent synthetic forecasting questions generated by Lightning Rod Labs' Foresight Learning framework. This "hydration" expands the training data significantly.
Stability Guard-rails: To enable stable training over 110,000 events in a single pass, lightweight guard-rails were introduced. These penalize:
- Gibberish or non-English responses.
- Missing rationales (expected within > ... tags).
- Token-length limits and input truncation (16,000 characters) were also employed. The reward function incorporates penalties based on the proportion of non-English or gibberish content and a fixed penalty for missing explanations, while also offering a bonus for explanation quality.

The model used is DeepSeek-R1-Distill-Qwen-14B. The reward signal is the negative Brier score: $R = -(\hat{p}-y)^2$ , where $\hat{p}$ is the model's predicted probability and $y$ is the binary outcome. For unparseable outputs during training, a strict Brier loss of 1.0 is applied. For evaluation, a "soft" Brier loss of 0.25 (equivalent to a random guess) is used for malformed outputs to avoid overly penalizing while retaining gradient signals.

Implementation Details for RL Algorithms:

Shared: AdamW optimizer, bfloat16 precision, global gradient norm clipping at 1.0, entropy bonus of 0.001.
GRPO/Modified GRPO: Actor learning rate $1 \times 10^{-6}$ , initial KL penalty $0.005$, PPO ratio-clip $\epsilon=0.20$ , $G=4$ rollouts per prompt.
ReMax: Actor learning rate $2 \times 10^{-6}$ , same KL schedule as GRPO, value baseline trained with learning rate $1 \times 10^{-6}$ using MSE loss (scaled by 0.5).
DPO (baseline): 4 epochs, $\beta=0.1$ , constant learning rate $1 \times 10^{-5}$ , batch size 128. All on-policy algorithms (GRPO, Mod-GRPO, ReMax) were trained strictly online, with each question encountered chronologically only once.

Key Results:

The evaluation was performed on a hold-out set of 3,300 Polymarket questions.

Algorithm Comparison (10k questions, no guard-rails): ReMax achieved the best accuracy (Soft-Brier: 0.197) and calibration (ECE: 0.0507), outperforming DPO, Modified GRPO, standard GRPO, and the base model.
Scaling ReMax (100k questions, with guard-rails):
- A single ReMax model trained on 100k questions showed Brier of 0.195 and ECE of 0.0438.
- An ensemble of seven ReMax predictions (ReMax Ensemble-7) further improved accuracy to a Brier score of 0.193 and achieved an ECE of 0.042.
Comparison with Baselines (ReMax Ensemble-7):
- Accuracy (Brier): Matched OpenAI's o1 (0.193 vs. 0.197 for o1, $p=0.23$ ). Outperformed the base DeepSeek-R1 model (0.214, $p<0.001$ ). Trailed Polymarket prices (0.162, $p<0.001$ ).
- Calibration (ECE): Significantly surpassed o1 (0.042 vs. 0.0895 for o1, $p<0.001$ ) and essentially matched Polymarket's calibration (0.0425, $p=0.99$ ).
Hypothetical Trading Evaluation:
- A simple trading strategy was simulated: bet if the model's predicted edge over the market price exceeded its own ECE.
- ReMax Ensemble-7 generated \$127 in hypothetical profit, significantly more than o1 (\$92, $p=0.037$ ) and the base model (\$72, $p=0.001$ ). Modified-GRPO earned \$111.
- The ReMax model's advantage was most pronounced in markets where Polymarket itself showed lower confidence (50-65% probability range).

Practical Implications and Applications:

The paper demonstrates a practical pathway to enhance LLMs for real-world forecasting tasks characterized by noisy and delayed feedback.

Improved Forecasting Tools: The refined RLVR methods can turn moderately-sized LLMs (14B parameters) into economically valuable forecasting tools. This is significant as it suggests that state-of-the-art forecasting capabilities are not solely the domain of massive frontier models.

RL Algorithm Selection and Adaptation: For forecasting, using ReMax or a Modified GRPO (without per-question reward normalization) is preferable to standard GRPO. These methods better preserve crucial learning signals from large errors, leading to improved calibration.

# Conceptual pseudocode for Modified GRPO advantage
def calculate_modified_grpo_advantage(rewards):
    # rewards: list of rewards for G rollouts for a single question
    mean_reward = sum(rewards) / len(rewards)
    advantages = [r - mean_reward for r in rewards] # No division by std_dev
    return advantages

# Conceptual pseudocode for ReMax advantage
def calculate_remax_advantage(rewards, baselines):
    # rewards: list of rewards for G rollouts
    # baselines: list of learned baseline values for each rollout
    advantages = [r - b for r, b in zip(rewards, baselines)]
    return advantages

Data Strategy: Augmenting real-world data with high-quality synthetic data (like the 100k questions from Foresight Learning) can substantially improve training, especially for online, single-pass learning regimes.

Stability Measures: Implementing guard-rails to penalize undesirable outputs (gibberish, non-English, missing rationales) is critical for stable training at scale, particularly when dealing with large and potentially noisy datasets. These can be implemented as a post-processing step on the LLM's output before calculating the reward.

def calculate_penalized_reward(raw_reward, output_text, explanation_quality):
    penalty = 0.0
    if contains_non_english(output_text):
        penalty += NON_ENGLISH_PENALTY_WEIGHT * non_english_proportion(output_text)
    if contains_gibberish(output_text):
        penalty += GIBBERISH_PENALTY_WEIGHT * gibberish_proportion(output_text)
    if not has_explanation(output_text):
        penalty += MISSING_EXPLANATION_PENALTY
    
    bonus = EXPLANATION_QUALITY_WEIGHT * explanation_quality
    
    final_reward = raw_reward - penalty + bonus
    return final_reward

Exploiting Calibration: The paper highlights that superior calibration can be directly translated into economic value. The trading strategy that only bet when the model's perceived edge exceeded its ECE was particularly effective. This suggests deploying forecasting models with an awareness of their calibration characteristics.

Limitations and Considerations:

The model still lags behind actual market prices in terms of raw Brier score accuracy, suggesting markets incorporate information not available to or processed as effectively by the model.
The information cutoff for the model (00:00 UTC on prediction day) ensures no look-ahead bias but may disadvantage it against real-time market movements.
While guard-rails reduce failure modes like gibberish, some ungrammatical phrases can persist.

In conclusion, the paper offers valuable insights and practical methods for applying RLVR to forecasting. It shows that by carefully selecting RL algorithms, adapting their reward processing, augmenting data, and implementing stability measures, even mid-sized LLMs can be trained to make well-calibrated and economically useful predictions directly from outcome data.

PDF Markdown

Follow-up Questions

Related Papers

Authors (5)

Tweets

https://twitter.com/camrobjones/status/1929884917304832166

https://twitter.com/SchoeneggerPhil/status/1927666075702862313

https://twitter.com/betterhn50/status/1927439665868546050

https://twitter.com/elliotarledge/status/1944848174285513161

https://twitter.com/bohannon_bot/status/1928037743738736811

https://twitter.com/hr0nix/status/1928085760478966173

HackerNews

Outcome-Based Reinforcement Learning to Predict the Future (99 points, 15 comments)