ReMax with Baseline-Subtracted Advantages
- The paper introduces ReMax with baseline-subtracted advantages, a novel RL approach that leverages explicit baseline subtraction (via greedy decoding or a learned function) to reduce gradient variance without a separate value network.
- It employs a gradient estimator that subtracts a baseline from trajectory-level rewards in deterministic transitions, enhancing sample efficiency and simplifying implementation compared to PPO.
- Empirical results indicate that ReMax matches or exceeds PPO performance in LLM alignment and event forecasting while achieving notable reductions in GPU memory usage and training time.
ReMax with baseline-subtracted advantages is a @@@@2@@@@ (RL) methodology designed for efficient LLM alignment and outcome-based RL with verifiable or trajectory-level rewards. It builds on the REINFORCE algorithm, incorporating variance reduction through action-independent baselining. Unlike Proximal Policy Optimization (PPO), ReMax leverages deterministic transitions and trajectory-level rewards specific to RL from human feedback (RLHF) and does not require a separate value model for baseline computation. Instead, it employs explicit baseline subtraction, either via greedy decoding in LLM alignment or via a learned, per-input function in outcome-forecasting. This approach simplifies implementation, enhances sample efficiency, and offers significant reductions in computational resource requirements while maintaining or exceeding the empirical performance of PPO.
1. Mathematical Foundation and Gradient Estimator
The central objective in RLHF and outcome-based RL is to maximize the expected reward over prompt–response trajectories or other domain-specific outputs. For an LLM with parameters , the expected reward is
The policy gradient, via the log-derivative trick (REINFORCE), is
To reduce estimator variance, introduce a baseline and define the baseline-subtracted advantage
which yields the unbiased gradient estimator
In outcome-based RL with verifiable rewards (RLVR), as used in event forecasting, the baseline is typically parameterized and trained to minimize prediction error on observed rewards, with the gradient estimating:
where is the number of rollouts, is the KL penalty, and is the divergence to a reference policy (Li et al., 2023, Turtel et al., 23 May 2025).
2. Baseline Choices and Computation
In ReMax for LLM alignment, the baseline is the reward of a greedily decoded response:
Practically, each minibatch prompt yields two trajectories: one sampled stochastically, another by greedy decoding. Their rewards are used for advantage computation without a separately parameterized value network. This bypasses the need for value-function training inherent in PPO-based RLHF.
In RLVR forecasting, the baseline is either a small MLP or a per-input scalar buffer trained under MSE to minimize
A key distinction is that in LLM alignment the baseline is non-parametric (greedy rollout reward), whereas in forecasting with RLVR it is a parametric or tabular function learned online (Li et al., 2023, Turtel et al., 23 May 2025).
3. Pseudocode and Workflow
The ReMax implementation for RLHF is notably concise. The canonical ReMax loop is as follows:
1 2 3 4 5 6 7 |
for x in prompts: # 1. sample one prompt a_stoch = LM.sample(x, greedy=False) # 2. sample a random response a_greedy = LM.sample(x, greedy=True) # 3. sample a greedy response for baseline adv = RM(x, a_stoch) - RM(x, a_greedy) # 4. compute baseline-subtracted advantage logp = LM.log_prob(x, a_stoch) # 5. log-prob of stochastic rollout loss = -(logp.sum(dim=-1) * adv).mean() # 6. policy-gradient step loss.backward(); optimizer.step(); optimizer.zero_grad() |
In outcome-based RLVR (forecasting), ReMax is paired with a learnable baseline network, processed as follows:
- For each question , sample rollouts from .
- Compute rewards and current baseline .
- Compute advantages .
- Update with gradient step on the baseline-subtracted advantage and KL penalty.
- Update (the baseline) with MSE loss gradient (Turtel et al., 23 May 2025).
4. Variance Reduction and Theoretical Properties
Subtracting a baseline does not change the expected value of the policy gradient but reduces variance substantially. If the baseline is independent of the sampled actions, then
but variance is
Minimizing by tuning maximizes the variance reduction through the negative covariance term. In ReMax, the variance of the gradient estimator is bounded as , where is the minibatch size, the sequence length, and the reward bound (Li et al., 2023).
5. Hyper-parameterization and Computational Efficiency
ReMax eliminates the need for a value network and its associated hyper-parameters (value learning rate, GAE , value clipping, off-policy epochs) required by PPO, reducing the hyper-parameter search space. The only essential hyper-parameters are:
- Learning rate (e.g., $1$e-6 for Llama-2-7B)
- KL penalty coefficient (e.g., $0.1$ for one-step, $0.01$ for full-step KL)
- Sampling temperature (typically $1.0$)
- Top-p cutoff (e.g., $0.9$)
- Batch size
Comparative resource use for a 7B model:
- PPO: 319 GB GPU memory, 2.9 h/epoch
- ReMax: 172 GB (≈46% GPU savings), 1.8 h/epoch (1.6× speed-up)
6. Empirical Results and Applications
LLM Alignment (RLHF)
On the full-hh-rlhf dataset (112k prompts), ReMax matches PPO’s validation reward within one epoch, with flat, stable gradient norms. On benchmarks:
| Model | AlpacaEval Win (%) | MT-Bench Score |
|---|---|---|
| SFT | 92.78 | 7.516 |
| +PPO | 94.07 | 7.671 |
| +ReMax | 94.78 | 7.739 |
Application to Mistral-7B-Instruct after 20k prompts also yields 94.78% win/7.739 on these metrics, marking a new open-source SOTA for 7B LLMs.
Outcome-based RLVR Forecasting
In event forecasting, ReMax with a learned baseline achieves superior calibration and Brier scores compared to Modified-GRPO, DPO, and baseline OpenAI o1:
| Method | Brier (↓, single/ensemble) | ECE (↓, single/ensemble) | Trading Profit (\$) |
|---|---|---|---|
| ReMax | 0.197 / 0.193 | 0.0507 / 0.0424 | 127 |
| Modified-GRPO | 0.206 | 0.096 | 111 |
| DPO | 0.205 | 0.084 | – |
| o1 | 0.193 | 0.042 | 92 |
Guard-rails penalizing gibberish, non-English responses, and missing rationales are implemented directly in the reward shaping and thus in the baseline-subtracted advantage, stabilizing the learning against reward hacking and extreme outputs (Turtel et al., 23 May 2025).
7. Practical Implications and Extensions
ReMax with baseline-subtracted advantages demonstrates that careful exploitation of domain-specific reward structures in RLHF and RLVR—especially deterministic transitions and trajectory-level rewards—yields simplified, stable, and cheaper RL-based training for LLMs and forecasting. The plug-and-play nature of the greedy or learned baseline, eliminations of value network training, and reduction of gradient variance are significant for both research and production LLM alignment.
Adaptations to outcome-based RLVR (e.g., forecasting) confirm that replacing per-question variance normalization (as in GRPO) with a jointly optimized value baseline preserves calibration errors and reduces gradient noise. This facilitates stable one-pass online RL over large, temporally ordered datasets and demonstrates economic viability by converting calibration gains into hypothetical profit in prediction markets (Turtel et al., 23 May 2025).
A plausible implication is that scaling these techniques with robust baseline adaptation, guard-rails, and ensembling offers a general, computationally efficient pathway for high-quality LLM alignment and real-world decision-making tasks.