ReMax with Baseline-Subtracted Advantages

Updated 13 December 2025

The paper introduces ReMax with baseline-subtracted advantages, a novel RL approach that leverages explicit baseline subtraction (via greedy decoding or a learned function) to reduce gradient variance without a separate value network.
It employs a gradient estimator that subtracts a baseline from trajectory-level rewards in deterministic transitions, enhancing sample efficiency and simplifying implementation compared to PPO.
Empirical results indicate that ReMax matches or exceeds PPO performance in LLM alignment and event forecasting while achieving notable reductions in GPU memory usage and training time.

ReMax with baseline-subtracted advantages is a @@@@2@@@@ (RL) methodology designed for efficient LLM alignment and outcome-based RL with verifiable or trajectory-level rewards. It builds on the REINFORCE algorithm, incorporating variance reduction through action-independent baselining. Unlike Proximal Policy Optimization (PPO), ReMax leverages deterministic transitions and trajectory-level rewards specific to RL from human feedback (RLHF) and does not require a separate value model for baseline computation. Instead, it employs explicit baseline subtraction, either via greedy decoding in LLM alignment or via a learned, per-input function in outcome-forecasting. This approach simplifies implementation, enhances sample efficiency, and offers significant reductions in computational resource requirements while maintaining or exceeding the empirical performance of PPO.

1. Mathematical Foundation and Gradient Estimator

The central objective in RLHF and outcome-based RL is to maximize the expected reward over prompt–response trajectories or other domain-specific outputs. For an LLM with parameters $\theta$ , the expected reward is

$J(\theta) = \mathbb{E}_{x \sim \rho}\,\mathbb{E}_{a_{1:T} \sim \pi_\theta(\cdot|x)} \bigl[ r(x, a_{1:T}) \bigr]$

The policy gradient, via the log-derivative trick (REINFORCE), is

$\nabla_\theta J(\theta) = \mathbb{E}_{x, a_{1:T}} \left[ \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_t | x, a_{1:t-1})\, r(x, a_{1:T}) \right]$

To reduce estimator variance, introduce a baseline $b(x)$ and define the baseline-subtracted advantage

$A(x, a_{1:T}) = r(x, a_{1:T}) - b(x)$

which yields the unbiased gradient estimator

$\nabla_\theta J(\theta) = \mathbb{E}_{x, a_{1:T}} \left[ \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_t | x, a_{1:t-1})\, A(x, a_{1:T}) \right]$

In outcome-based RL with verifiable rewards (RLVR), as used in event forecasting, the baseline $b_\phi(q)$ is typically parameterized and trained to minimize prediction error on observed rewards, with the gradient estimating:

$\mathcal{J}_\text{ReMax}(\theta) = \mathbb{E}_{q, \{o^i\}} \Bigl[ \frac{1}{G} \sum_{i=1}^G \big(r^i - b_\phi(q)\big) \sum_{t=1}^{|o^i|} \log \pi_\theta(o^i_t|q,o^i_{<t}) - \beta D_\mathrm{KL}(\pi_\theta\|\pi_\mathrm{ref}) \Bigr]$

where $G$ is the number of rollouts, $\beta$ is the KL penalty, and $D_\mathrm{KL}$ is the divergence to a reference policy (Li et al., 2023, Turtel et al., 23 May 2025).

2. Baseline Choices and Computation

In ReMax for LLM alignment, the baseline $b(x)$ is the reward of a greedily decoded response:

$b(x) = r(x, \bar a_{1:T}), \quad \bar a_{1:T} = \arg\max_{a_{1:T}} \pi_\theta(a_{1:T}|x)$

Practically, each minibatch prompt yields two trajectories: one sampled stochastically, another by greedy decoding. Their rewards are used for advantage computation without a separately parameterized value network. This bypasses the need for value-function training inherent in PPO-based RLHF.

In RLVR forecasting, the baseline $b_\phi(q)$ is either a small MLP or a per-input scalar buffer trained under MSE to minimize

$L_\text{baseline}(\phi) = \tfrac{1}{2} \mathbb{E}_{q, \{o^i\}} \Bigl[ \tfrac{1}{G} \sum_{i=1}^G \big(r^i - b_\phi(q)\big)^2 \Bigr]$

A key distinction is that in LLM alignment the baseline is non-parametric (greedy rollout reward), whereas in forecasting with RLVR it is a parametric or tabular function learned online (Li et al., 2023, Turtel et al., 23 May 2025).

3. Pseudocode and Workflow

The ReMax implementation for RLHF is notably concise. The canonical ReMax loop is as follows:

for x in prompts:                                    # 1. sample one prompt
    a_stoch = LM.sample(x, greedy=False)             # 2. sample a random response
    a_greedy = LM.sample(x, greedy=True)             # 3. sample a greedy response for baseline
    adv = RM(x, a_stoch) - RM(x, a_greedy)           # 4. compute baseline-subtracted advantage
    logp = LM.log_prob(x, a_stoch)                   # 5. log-prob of stochastic rollout
    loss = -(logp.sum(dim=-1) * adv).mean()          # 6. policy-gradient step
    loss.backward(); optimizer.step(); optimizer.zero_grad()

(Li et al., 2023)

In outcome-based RLVR (forecasting), ReMax is paired with a learnable baseline network, processed as follows:

For each question $q$ , sample $G$ rollouts from $\pi_{\theta_\mathrm{old}}$ .
Compute rewards $r^i$ and current baseline $b = b_\phi(q)$ .
Compute advantages $A^i = r^i - b$ .
Update $\theta$ with gradient step on the baseline-subtracted advantage and KL penalty.
Update $\phi$ (the baseline) with MSE loss gradient (Turtel et al., 23 May 2025).

4. Variance Reduction and Theoretical Properties

Subtracting a baseline does not change the expected value of the policy gradient but reduces variance substantially. If the baseline $b$ is independent of the sampled actions, then

$E[(R-b)\nabla \log \pi] = E[R\nabla \log \pi]$

but variance is

$\mathrm{Var}[(R-b)\nabla \log\pi] = \mathrm{Var}[R \nabla \log\pi] - 2 \mathrm{Cov}[R \nabla \log\pi, b \nabla \log\pi] + \mathrm{Var}[b \nabla \log\pi]$

Minimizing $E[(R-b)^2]$ by tuning $b \approx E[R]$ maximizes the variance reduction through the negative covariance term. In ReMax, the variance of the gradient estimator is bounded as $O(r_\text{max}^2 T^2 / N)$ , where $N$ is the minibatch size, $T$ the sequence length, and $r_\text{max}$ the reward bound (Li et al., 2023).

5. Hyper-parameterization and Computational Efficiency

ReMax eliminates the need for a value network and its associated hyper-parameters (value learning rate, GAE $\lambda$ , value clipping, off-policy epochs) required by PPO, reducing the hyper-parameter search space. The only essential hyper-parameters are:

Learning rate (e.g., $1$e-6 for Llama-2-7B)
KL penalty coefficient $\beta$ (e.g., $0.1$ for one-step, $0.01$ for full-step KL)
Sampling temperature (typically $1.0$)
Top-p cutoff (e.g., $0.9$)
Batch size

Comparative resource use for a 7B model:

PPO: $\sim$ 319 GB GPU memory, $\sim$ 2.9 h/epoch
ReMax: $\sim$ 172 GB (≈46% GPU savings), $\sim$ 1.8 h/epoch (1.6× speed-up)

(Li et al., 2023)

6. Empirical Results and Applications

LLM Alignment (RLHF)

On the full-hh-rlhf dataset (112k prompts), ReMax matches PPO’s validation reward within one epoch, with flat, stable gradient norms. On benchmarks:

Model	AlpacaEval Win (%)	MT-Bench Score
SFT	92.78	7.516
+PPO	94.07	7.671
+ReMax	94.78	7.739

Application to Mistral-7B-Instruct after 20k prompts also yields 94.78% win/7.739 on these metrics, marking a new open-source SOTA for 7B LLMs.

Outcome-based RLVR Forecasting

In event forecasting, ReMax with a learned baseline achieves superior calibration and Brier scores compared to Modified-GRPO, DPO, and baseline OpenAI o1:

Method	Brier (↓, single/ensemble)	ECE (↓, single/ensemble)	Trading Profit (\$)
ReMax	0.197 / 0.193	0.0507 / 0.0424	127
Modified-GRPO	0.206	0.096	111
DPO	0.205	0.084	–
o1	0.193	0.042	92

Guard-rails penalizing gibberish, non-English responses, and missing rationales are implemented directly in the reward shaping and thus in the baseline-subtracted advantage, stabilizing the learning against reward hacking and extreme outputs (Turtel et al., 23 May 2025).

7. Practical Implications and Extensions

ReMax with baseline-subtracted advantages demonstrates that careful exploitation of domain-specific reward structures in RLHF and RLVR—especially deterministic transitions and trajectory-level rewards—yields simplified, stable, and cheaper RL-based training for LLMs and forecasting. The plug-and-play nature of the greedy or learned baseline, eliminations of value network training, and reduction of gradient variance are significant for both research and production LLM alignment.

Adaptations to outcome-based RLVR (e.g., forecasting) confirm that replacing per-question variance normalization (as in GRPO) with a jointly optimized value baseline preserves calibration errors and reduces gradient noise. This facilitates stable one-pass online RL over large, temporally ordered datasets and demonstrates economic viability by converting calibration gains into hypothetical profit in prediction markets (Turtel et al., 23 May 2025).

A plausible implication is that scaling these techniques with robust baseline adaptation, guard-rails, and ensembling offers a general, computationally efficient pathway for high-quality LLM alignment and real-world decision-making tasks.