Papers
Topics
Authors
Recent
2000 character limit reached

Beta Normalization Policy Optimization

Updated 16 December 2025
  • BNPO is an adaptive policy gradient method that leverages Beta distributions to normalize reward signals, reducing variance in RL tasks with binary or multi-component rewards.
  • It generalizes methods like REINFORCE and GRPO by dynamically adjusting Beta parameters to align with shifting reward statistics and enhance gradient stability.
  • Empirical results on large language model reasoning tasks show BNPO achieves superior performance and stability, making it effective for complex benchmarks.

Beta Normalization Policy Optimization (BNPO) is an adaptive policy gradient methodology for reinforcement learning that addresses variance reduction and stability when optimizing policies with binary-valued or multi-component reward signals. BNPO adaptively normalizes advantage estimates using Beta distributions parameterized to align with the evolving policy-induced reward statistics, thereby generalizing and subsuming widely used normalization techniques such as REINFORCE with baseline and Group Relative Policy Optimization (GRPO). BNPO is particularly designed for LLM reasoning tasks with rule-based, binary rewards, as exemplified by applications in mathematical problem solving and logic reasoning benchmarks (Xiao et al., 3 Jun 2025).

1. Theoretical Foundations and Motivation

Policy gradient algorithms aim to maximize expected cumulative reward by adjusting policy parameters θ\theta according to the gradient estimate:

θJ(θ)=E[θlogπθ(oq)R(q,o)]\nabla_\theta J(\theta) = \mathbb{E}[\nabla_\theta \log \pi_\theta(o|q)\, R(q,o)]

Directly using the reward RR can produce extremely high variance, impeding sample efficiency and stability. REINFORCE with baseline subtracts the empirical average reward μ\mu to reduce variance:

A(q,o)=R(q,o)μA(q,o) = R(q,o) - \mu

GRPO further divides this residual by the standard deviation σ\sigma to standardize per-batch fluctuations:

A(q,o)=R(q,o)μσA(q,o) = \frac{R(q,o) - \mu}{\sigma}

However, both REINFORCE and GRPO employ static or batch-dependent normalization that does not account for the shifting distribution of rewards as policy learning progresses. This discrepancy is pronounced in tasks where the policy’s binary reward distribution changes substantially over training, as observed in large-scale reasoning tasks (e.g., DeepSeek-R1, Kimi-k1.5). BNPO formulates adaptive normalization by endowing the reward expectation with a Beta prior, dynamically estimating Beta parameters each policy update, and performing advantage normalization accordingly (Xiao et al., 3 Jun 2025).

2. Mathematical Formulation

2.1 Reward Distribution and Modeling

Let qρ(q)q \sim \rho(q) denote sampled queries and oπθ(q)o \sim \pi_\theta(\cdot|q) model-generated outputs, with binary reward

$R(q,o) = \begin{cases} 1 & \text{if $oiscorrectfor is correct for q$} \ 0 & \text{otherwise} \end{cases}$

The expected reward p(q)=Eoπθ[R(q,o)q]p(q) = \mathbb{E}_{o \sim \pi_\theta}[R(q,o)|q] acts as the success probability of a Bernoulli, and across qq, p(q)p(q) is modeled as Beta(a,b)\operatorname{Beta}(a, b).

2.2 Beta Normalization and BNPO Advantage

BNPO designs a normalization density

fN(p;α,β)=1B(α,β)pα1(1p)β1f_N(p; \alpha, \beta) = \frac{1}{B(\alpha, \beta)}\, p^{\alpha-1}(1-p)^{\beta-1}

and constructs the BNPO advantage:

Aα,β(q,o)=R(q,o)p(q)fN(p(q);α,β)A_{\alpha, \beta}(q,o) = \frac{R(q,o) - p(q)}{f_N(p(q); \alpha, \beta)}

The policy gradient estimator is thus:

θJ(θ)=Eq,o[θlogπθ(oq)Aα,β(q,o)]\nabla_\theta J(\theta) = \mathbb{E}_{q, o}\left[\nabla_\theta \log \pi_\theta(o|q) A_{\alpha, \beta}(q,o)\right]

2.3 Adaptive Parameter Selection

Empirically, across sampled qq in a batch, estimate p(q)p(q), compute its moments, and fit Beta parameters (a,b)(a, b) via the method of moments. BNPO then chooses (α,β)(\alpha, \beta) adaptively to minimize the variance of the BNPO gradient. The optimality theorem states:

  • Var[gα,β]\text{Var}[g_{\alpha,\beta}] is finite iff α<(a+3)/2\alpha < (a+3)/2, β<(b+3)/2\beta < (b+3)/2
  • The unique variance minimum is at α=1+a/3\alpha^* = 1 + a/3, β=1+b/3\beta^* = 1 + b/3

This ensures the normalization density at each update is optimally matched to the current policy-induced reward landscape (Xiao et al., 3 Jun 2025).

3. Generalization of REINFORCE and GRPO

BNPO encompasses established techniques as special cases:

  • α=β=1\alpha = \beta = 1 yields fN=1f_N=1 and A1,1(q,o)=R(q,o)p(q)A_{1,1}(q,o)=R(q,o)-p(q), exactly REINFORCE with baseline.
  • α=β=3/2\alpha = \beta = 3/2 gives fNp(1p)f_N \propto \sqrt{p(1-p)}, and thus A3/2,3/2(Rp)/p(1p)A_{3/2,3/2} \propto (R-p)/\sqrt{p(1-p)}, recovering GRPO up to scaling.

By optimizing (α,β)(\alpha,\beta) online, BNPO generalizes these fixed normalization approaches, thereby achieving adaptive variance reduction throughout policy optimization.

4. Extension to Multi-Component Rewards: Advantage Decomposition

For tasks with multi-component binary rewards {R(i)(q,o)}i=1K\{R^{(i)}(q,o)\}_{i=1}^K:

A(q,o)=1Ki=1KR(i)(q,o)p(i)(q)fN(p(i)(q);α(i),β(i))A(q,o) = \frac{1}{K}\sum_{i=1}^K \frac{R^{(i)}(q,o)-p^{(i)}(q)}{f_N(p^{(i)}(q);\alpha^{(i)},\beta^{(i)})}

Each reward stream is treated independently with its own moment-estimated Beta parameters (a(i),b(i))(a^{(i)}, b^{(i)}) and adaptive (α(i),β(i))(\alpha^{(i)}, \beta^{(i)}). This decomposition enables component-wise variance adaptation and normalization, improving efficiency when reward signals saturate or vary asynchronously among components.

5. Algorithmic Procedure and Implementation

The main loop for BNPO is as follows:

  1. For each learning iteration, sample a batch of queries {qj}\{q_j\}.
  2. For each qjq_j, sample model outputs {oj,k}\{o_{j,k}\}, and observe rewards R(qj,oj,k)R(q_j, o_{j,k}).
  3. Estimate p(qj)1mkR(qj,oj,k)p(q_j)\approx \frac{1}{m}\sum_{k} R(q_j,o_{j,k}).
  4. Aggregate means and variances of p(q)p(q), fit Beta parameters (a,b)(a,b).
  5. Compute (α,β)=1+a/3,1+b/3(\alpha, \beta) = 1 + a/3, 1 + b/3.
  6. Compute BNPO advantages A(qj,oj,k)A(q_j, o_{j,k}).
  7. Use A(qj,oj,k)A(q_j, o_{j,k}) within PPO’s clipped surrogate loss for policy updates.
  8. Optionally, utilize multi-component reward decomposition as described above.

Computational overhead beyond PPO is minimal; Beta moment calculations per batch are O(n+m)O(n+m). Numerical stability is maintained by operating in log-density space and clipping (α,β)(\alpha, \beta) away from the boundaries (1e3,(a+3)/2ϵ)(1e^{-3}, (a+3)/2-\epsilon) as required (Xiao et al., 3 Jun 2025).

6. Experimental Evaluation and Empirical Performance

BNPO has been evaluated on mathematical reasoning tasks using Qwen2.5-Math-1.5B and Qwen2.5-Math-7B models. Training utilized MATH (7,500 problems) with batch size $32$, $16$ outputs per question, single PPO epoch per step, and $5$ total training epochs.

On standardized benchmarks (pass@1):

  • For Qwen2.5-Math-1.5B, BNPO achieves an average score of 39.4%39.4\%, outperforming REINFORCE, ReMax, GRPO, and REINFORCE++.
  • For Qwen2.5-Math-7B, the average rises to 47.8%47.8\%, with especially strong improvements on the AMC23 test (+4.3\% over GRPO).

Gradient stability is assessed via the norm θJ\|\nabla_\theta J\| over training steps: BNPO consistently yields the smallest fluctuations, with other methods showing greater variance or slower convergence. In ablation on multi-component rewards (e.g., Qwen2.5-1.5B-Instruct), advantage decomposition grants further modest improvements, and BNPO-based methods consistently attain the best results (Xiao et al., 3 Jun 2025).

Model/Method MATH500 AMC23 AIME2024 AIME2025 Avg
Qwen2.5-1.5B Base 28.0 27.3 6.0 3.1 16.1
REINFORCE 74.8 51.6 18.3 11.3 39.0
GRPO 75.2 52.3 19.0 9.4 39.0
BNPO 73.4 54.5 18.3 11.5 39.4
Qwen2.5-7B Base 41.4 32.5 11.0 5.0 22.5
REINFORCE 78.4 61.7 34.2 14.6 47.2
GRPO 78.6 64.5 32.3 12.9 47.1
BNPO 77.0 68.8 32.1 13.3 47.8

7. Practical Considerations and Technical Guidance

  • Use batch size 32\geq32 and m16m \geq 16 outputs per query to stabilize p(q)p(q) estimation.
  • Recompute Beta moments and normalization parameters every step.
  • Operate in log-density space for fNf_N to maintain numerical stability.
  • Clip (α,β)(\alpha,\beta) to avoid boundaries where the variance blows up.
  • Gradient clipping is recommended as in PPO to mitigate potential heavy-tailed advantages.
  • Temperature scheduling (e.g., $1.0$ for training, $0.6$ for evaluation) is effective for balancing exploration and evaluation.
  • Monitor (α,β)(\alpha, \beta) trajectories for smooth adaptation, indicating robust normalization dynamics.

BNPO provides a theoretically grounded, low-overhead, and deployable framework for adaptive reward normalization in policy gradient RL with nonstationary binary or multi-component signals. Its generalization of static normalization techniques and empirical superiority on reasoning tasks align it closely with the needs of modern large-model RL optimization (Xiao et al., 3 Jun 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Beta Normalization Policy Optimization (BNPO).