Beta Normalization Policy Optimization
- BNPO is an adaptive policy gradient method that leverages Beta distributions to normalize reward signals, reducing variance in RL tasks with binary or multi-component rewards.
- It generalizes methods like REINFORCE and GRPO by dynamically adjusting Beta parameters to align with shifting reward statistics and enhance gradient stability.
- Empirical results on large language model reasoning tasks show BNPO achieves superior performance and stability, making it effective for complex benchmarks.
Beta Normalization Policy Optimization (BNPO) is an adaptive policy gradient methodology for reinforcement learning that addresses variance reduction and stability when optimizing policies with binary-valued or multi-component reward signals. BNPO adaptively normalizes advantage estimates using Beta distributions parameterized to align with the evolving policy-induced reward statistics, thereby generalizing and subsuming widely used normalization techniques such as REINFORCE with baseline and Group Relative Policy Optimization (GRPO). BNPO is particularly designed for LLM reasoning tasks with rule-based, binary rewards, as exemplified by applications in mathematical problem solving and logic reasoning benchmarks (Xiao et al., 3 Jun 2025).
1. Theoretical Foundations and Motivation
Policy gradient algorithms aim to maximize expected cumulative reward by adjusting policy parameters according to the gradient estimate:
Directly using the reward can produce extremely high variance, impeding sample efficiency and stability. REINFORCE with baseline subtracts the empirical average reward to reduce variance:
GRPO further divides this residual by the standard deviation to standardize per-batch fluctuations:
However, both REINFORCE and GRPO employ static or batch-dependent normalization that does not account for the shifting distribution of rewards as policy learning progresses. This discrepancy is pronounced in tasks where the policy’s binary reward distribution changes substantially over training, as observed in large-scale reasoning tasks (e.g., DeepSeek-R1, Kimi-k1.5). BNPO formulates adaptive normalization by endowing the reward expectation with a Beta prior, dynamically estimating Beta parameters each policy update, and performing advantage normalization accordingly (Xiao et al., 3 Jun 2025).
2. Mathematical Formulation
2.1 Reward Distribution and Modeling
Let denote sampled queries and model-generated outputs, with binary reward
$R(q,o) = \begin{cases} 1 & \text{if $oq$} \ 0 & \text{otherwise} \end{cases}$
The expected reward acts as the success probability of a Bernoulli, and across , is modeled as .
2.2 Beta Normalization and BNPO Advantage
BNPO designs a normalization density
and constructs the BNPO advantage:
The policy gradient estimator is thus:
2.3 Adaptive Parameter Selection
Empirically, across sampled in a batch, estimate , compute its moments, and fit Beta parameters via the method of moments. BNPO then chooses adaptively to minimize the variance of the BNPO gradient. The optimality theorem states:
- is finite iff ,
- The unique variance minimum is at ,
This ensures the normalization density at each update is optimally matched to the current policy-induced reward landscape (Xiao et al., 3 Jun 2025).
3. Generalization of REINFORCE and GRPO
BNPO encompasses established techniques as special cases:
- yields and , exactly REINFORCE with baseline.
- gives , and thus , recovering GRPO up to scaling.
By optimizing online, BNPO generalizes these fixed normalization approaches, thereby achieving adaptive variance reduction throughout policy optimization.
4. Extension to Multi-Component Rewards: Advantage Decomposition
For tasks with multi-component binary rewards :
Each reward stream is treated independently with its own moment-estimated Beta parameters and adaptive . This decomposition enables component-wise variance adaptation and normalization, improving efficiency when reward signals saturate or vary asynchronously among components.
5. Algorithmic Procedure and Implementation
The main loop for BNPO is as follows:
- For each learning iteration, sample a batch of queries .
- For each , sample model outputs , and observe rewards .
- Estimate .
- Aggregate means and variances of , fit Beta parameters .
- Compute .
- Compute BNPO advantages .
- Use within PPO’s clipped surrogate loss for policy updates.
- Optionally, utilize multi-component reward decomposition as described above.
Computational overhead beyond PPO is minimal; Beta moment calculations per batch are . Numerical stability is maintained by operating in log-density space and clipping away from the boundaries as required (Xiao et al., 3 Jun 2025).
6. Experimental Evaluation and Empirical Performance
BNPO has been evaluated on mathematical reasoning tasks using Qwen2.5-Math-1.5B and Qwen2.5-Math-7B models. Training utilized MATH (7,500 problems) with batch size $32$, $16$ outputs per question, single PPO epoch per step, and $5$ total training epochs.
On standardized benchmarks (pass@1):
- For Qwen2.5-Math-1.5B, BNPO achieves an average score of , outperforming REINFORCE, ReMax, GRPO, and REINFORCE++.
- For Qwen2.5-Math-7B, the average rises to , with especially strong improvements on the AMC23 test (+4.3\% over GRPO).
Gradient stability is assessed via the norm over training steps: BNPO consistently yields the smallest fluctuations, with other methods showing greater variance or slower convergence. In ablation on multi-component rewards (e.g., Qwen2.5-1.5B-Instruct), advantage decomposition grants further modest improvements, and BNPO-based methods consistently attain the best results (Xiao et al., 3 Jun 2025).
| Model/Method | MATH500 | AMC23 | AIME2024 | AIME2025 | Avg |
|---|---|---|---|---|---|
| Qwen2.5-1.5B Base | 28.0 | 27.3 | 6.0 | 3.1 | 16.1 |
| REINFORCE | 74.8 | 51.6 | 18.3 | 11.3 | 39.0 |
| GRPO | 75.2 | 52.3 | 19.0 | 9.4 | 39.0 |
| BNPO | 73.4 | 54.5 | 18.3 | 11.5 | 39.4 |
| Qwen2.5-7B Base | 41.4 | 32.5 | 11.0 | 5.0 | 22.5 |
| REINFORCE | 78.4 | 61.7 | 34.2 | 14.6 | 47.2 |
| GRPO | 78.6 | 64.5 | 32.3 | 12.9 | 47.1 |
| BNPO | 77.0 | 68.8 | 32.1 | 13.3 | 47.8 |
7. Practical Considerations and Technical Guidance
- Use batch size and outputs per query to stabilize estimation.
- Recompute Beta moments and normalization parameters every step.
- Operate in log-density space for to maintain numerical stability.
- Clip to avoid boundaries where the variance blows up.
- Gradient clipping is recommended as in PPO to mitigate potential heavy-tailed advantages.
- Temperature scheduling (e.g., $1.0$ for training, $0.6$ for evaluation) is effective for balancing exploration and evaluation.
- Monitor trajectories for smooth adaptation, indicating robust normalization dynamics.
BNPO provides a theoretically grounded, low-overhead, and deployable framework for adaptive reward normalization in policy gradient RL with nonstationary binary or multi-component signals. Its generalization of static normalization techniques and empirical superiority on reasoning tasks align it closely with the needs of modern large-model RL optimization (Xiao et al., 3 Jun 2025).