ReMax & GRPO: Outcome Forecasting in LLMs

Updated 13 December 2025

ReMax and GRPO are reinforcement learning algorithms designed for calibrated probabilistic forecasting using outcome-based, verifiable rewards.
They employ tailored policy gradient updates with unique advantage computations to preserve gradients proportional to true Brier loss under noisy, delayed binary outcomes.
Empirical evaluations demonstrate that these methods improve calibration accuracy and trading simulation profitability while halving expected calibration error compared to baselines.

Group-Relative Policy Optimization (GRPO) and ReMax are reinforcement learning algorithms adapted for outcome-based, verifiable reward pipelines in probabilistic forecasting with LLMs. Originating from the reinforcement learning with verifiable rewards (RLVR) paradigm, both approaches address delayed, binary, and noisy reward regimes where standard supervised or simple RL fine-tuning is brittle. Turtel et al. (2025) present formal objective functions, algorithmic modifications, and practical outcomes when applying GRPO and ReMax for learning calibrated outcome forecasts from streams of real-world questions with binary resolution, illustrating their effectiveness on a 14B parameter LLM (Turtel et al., 23 May 2025).

1. Formal Objectives for GRPO and ReMax

The GRPO and ReMax algorithms are specified in terms of policy gradient updates using per-question, per-rollout rewards. Consider a forecasting “question” $q$ , with $G$ model-generated completions $o^1, \dots, o^G$ . Each completion $o^i$ yields a forecasted probability $\hat p^i \in [0,1]$ and receives a reward $r^i = -(\hat p^i - y)^2$ , where $y \in \{0,1\}$ is the binary outcome upon question resolution. The within-group mean $\mu$ and standard deviation $\sigma$ are

$\mu = \frac{1}{G}\sum_{i=1}^{G} r^i\,,\qquad \sigma = \sqrt{\frac{1}{G}\sum_{i=1}^{G}(r^i - \mu)^2}\,.$

Original GRPO computes a standardized advantage for each rollout:

$\hat A^i = \frac{r^i - \mu}{\sigma}\,,$

and maximizes a clipped-PPO surrogate with KL penalty

$\mathcal J_{\rm GRPO}(\theta) = \mathbb{E}_{q,\{o^i\}}\left[ \frac{1}{G}\sum_{i=1}^G \sum_{t=1}^{|o^i|} \min \left( r_{i,t} \hat A^i, \mathrm{clip}(r_{i,t}, 1-\varepsilon, 1+\varepsilon) \hat A^i \right) - \beta\, D_{\rm KL}(\pi_\theta \| \pi_{\rm ref}) \right]$

where $r_{i,t} = \frac{\pi_\theta(o^i_t|q,o^i_{<t})}{\pi_{\theta_{\rm old}}(o^i_t|q,o^i_{<t})}$ .

Modified GRPO omits the per-question variance $\sigma$ , instead using raw groupwise centering:

$\hat A^i = r^i - \mu\,,$

within the identical surrogate structure. This modification preserves the absolute magnitude of large Brier errors, avoiding dampening the learning signal for large miscalibration.

ReMax employs a learned scalar baseline $b^i$ (via MSE regression) for each rollout, and uses the baseline-subtracted advantage:

$\hat A^i = r^i - b^i\,,$

with the objective

$\mathcal J_{\rm ReMax}(\theta) = \mathbb{E}_{q,\{o^i\}} \left[ \frac{1}{G}\sum_{i=1}^G \hat A^i \sum_{t=1}^{|o^i|} \log \pi_\theta(o^i_t|q,o^i_{<t}) - \beta\, D_{\rm KL}(\pi_\theta \| \pi_{\rm ref}) \right].$

2. Reward Structure and Guardrails

To prevent undesired policy drift—such as generating gibberish, non-English outputs, or omitting explanatory rationales—the scalar reward $r^i$ is augmented with continuous penalties and bonuses: $r^i_{\text{total}} = -(\hat p^i - y)^2 - \lambda_{\rm lang}\, \rho_{\rm ne} - \lambda_{\rm gib}\, \rho_{\rm gib} - \lambda_{\rm miss}\, \mathbf{1}_{\rm miss} + \lambda_{\rm exp}\, \rho_{\rm explq} \,,$ where $\rho_{\rm ne}$ = non-English proportion, $\rho_{\rm gib}$ = gibberish proportion, $\rho_{\rm explq}$ = explanation quality, and $\mathbf{1}_{\rm miss}$ = indicator for a missing explanation block. The $\lambda$ parameters are positive to penalize invalid formatting and incentivize rationale quality.

3. Synthetic Data Hydration and Training Regimen

The pre-training set comprises approximately 10,000 chronological resolved yes/no questions from Polymarket. This is augmented with 100,000 synthetic forecasting prompts generated via Lightning Rod Labs’ proprietary Foresight Learning system, designed for temporal consistency by randomizing prediction timestamps between question open and close. Real and synthetic events are combined into a stream of 110,000 examples, sorted by resolution time. The model is exposed to each example in strict time order with no re-sampling or example weighting; hence, no example is presented more than once and no epoching is used.

4. Model Architecture, Optimization, and Online Curriculum

Base Model: DeepSeek-R1-Distill-Qwen-14B (14B parameters), using bfloat16.
Optimizer: AdamW (β₁ = 0.9, β₂ = 0.999, ε = 1e-8), global gradient norm clip at 1.0.
Entropy bonus: 0.001 per update.
GRPO: actor learning rate 1e-6, KL penalty coefficient initialized to 0.005, PPO clip ε = 0.20, $G = 4$ rollouts per question.
Modified GRPO: identical hyperparameters, with $r^i-\mu$ advantage.
ReMax: actor learning rate 2e-6, baseline learning rate 1e-6 (MSE ×0.5), matching KL penalty, $G = 4$ .
Baselines: Direct Preference Optimization (DPO) used as baseline (4 epochs, lr = 1e-5, β = 0.1, batch size 128).
Compute: Trained on a single 8×H100 node, with automatic mixed precision.
Inference Ensembling: 7 independently trained ReMax checkpoints are averaged at inference.
Curriculum: Strictly chronological, single pass; mixing of synthetic and real events occurs exclusively via time order.

5. Performance Evaluation and Metrics

Evaluation is performed on a holdout set of 3,300 time ordered Polymarket questions. The principal metrics are:

Soft-Brier Score: $\frac{1}{N}\sum (\hat p - y)^2$ with a lower bound of 0.25 for unparseable outputs.
Expected Calibration Error (ECE): Computed over 10 equal-mass bins.

Table: Main Outcomes (holdout set)

Model	Soft-Brier	ECE	p-values vs. o1
Polymarket	0.162	0.0425	–
ReMax Ensemble-7	0.193	0.0424	Brier: $p=0.23$ , ECE: $p<0.001$
OpenAI o1	0.197	0.0895	–
DeepSeek-R1 Base	0.214	0.1180	–

A simple one-share trading rule (Edge > ECE) yields: ReMax: \$127, o1: \$92, base: \$72, with ReMax’s profit$\$35.6 $higher than o1 (95% CI [\$ 2.0, \$68.6], $p=0.037$ ).

6. Implications, Findings, and Significance

By removing per-question variance normalization from GRPO and employing baseline-subtracted advantages in ReMax, the algorithms maintain gradients proportional to the true Brier loss, robustly capturing large miscalibration errors. The introduction of output guardrails further stabilizes training by penalizing non-conforming responses. These modifications enable a 14B-parameter model to match frontier LLMs in forecast accuracy while halving ECE, with improved calibration translating directly into higher hypothetical profit via trading simulation. This demonstrates that refined RLVR methods render small-scale LLMs competitive for learning well-calibrated, economically valuable probabilistic forecasts from outcome streams that are inherently noisy and delayed (Turtel et al., 23 May 2025).

This suggests a plausible path for upscaling RL-based training of LLM forecasters without requiring frontier-scale models, and highlights the importance of algorithmic details—especially in variance normalization and reward guardrails—for robust performance in outcome-only RL settings.

PDF Markdown Chat (Pro)

References (1)

Outcome-based Reinforcement Learning to Predict the Future (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to ReMax and GRPO.