Beta Normalization Policy Optimization

Updated 16 December 2025

BNPO is an adaptive policy gradient method that leverages Beta distributions to normalize reward signals, reducing variance in RL tasks with binary or multi-component rewards.
It generalizes methods like REINFORCE and GRPO by dynamically adjusting Beta parameters to align with shifting reward statistics and enhance gradient stability.
Empirical results on large language model reasoning tasks show BNPO achieves superior performance and stability, making it effective for complex benchmarks.

Beta Normalization Policy Optimization (BNPO) is an adaptive policy gradient methodology for reinforcement learning that addresses variance reduction and stability when optimizing policies with binary-valued or multi-component reward signals. BNPO adaptively normalizes advantage estimates using Beta distributions parameterized to align with the evolving policy-induced reward statistics, thereby generalizing and subsuming widely used normalization techniques such as REINFORCE with baseline and Group Relative Policy Optimization (GRPO). BNPO is particularly designed for LLM reasoning tasks with rule-based, binary rewards, as exemplified by applications in mathematical problem solving and logic reasoning benchmarks (Xiao et al., 3 Jun 2025).

1. Theoretical Foundations and Motivation

Policy gradient algorithms aim to maximize expected cumulative reward by adjusting policy parameters $\theta$ according to the gradient estimate:

$\nabla_\theta J(\theta) = \mathbb{E}[\nabla_\theta \log \pi_\theta(o|q)\, R(q,o)]$

Directly using the reward $R$ can produce extremely high variance, impeding sample efficiency and stability. REINFORCE with baseline subtracts the empirical average reward $\mu$ to reduce variance:

$A(q,o) = R(q,o) - \mu$

GRPO further divides this residual by the standard deviation $\sigma$ to standardize per-batch fluctuations:

$A(q,o) = \frac{R(q,o) - \mu}{\sigma}$

However, both REINFORCE and GRPO employ static or batch-dependent normalization that does not account for the shifting distribution of rewards as policy learning progresses. This discrepancy is pronounced in tasks where the policy’s binary reward distribution changes substantially over training, as observed in large-scale reasoning tasks (e.g., DeepSeek-R1, Kimi-k1.5). BNPO formulates adaptive normalization by endowing the reward expectation with a Beta prior, dynamically estimating Beta parameters each policy update, and performing advantage normalization accordingly (Xiao et al., 3 Jun 2025).

2. Mathematical Formulation

2.1 Reward Distribution and Modeling

Let $q \sim \rho(q)$ denote sampled queries and $o \sim \pi_\theta(\cdot|q)$ model-generated outputs, with binary reward

$R(q,o) = \begin{cases} 1 & \text{if $o $is correct for$ q$} \ 0 & \text{otherwise} \end{cases}$

The expected reward $p(q) = \mathbb{E}_{o \sim \pi_\theta}[R(q,o)|q]$ acts as the success probability of a Bernoulli, and across $q$ , $p(q)$ is modeled as $\operatorname{Beta}(a, b)$ .

2.2 Beta Normalization and BNPO Advantage

BNPO designs a normalization density

$f_N(p; \alpha, \beta) = \frac{1}{B(\alpha, \beta)}\, p^{\alpha-1}(1-p)^{\beta-1}$

and constructs the BNPO advantage:

$A_{\alpha, \beta}(q,o) = \frac{R(q,o) - p(q)}{f_N(p(q); \alpha, \beta)}$

The policy gradient estimator is thus:

$\nabla_\theta J(\theta) = \mathbb{E}_{q, o}\left[\nabla_\theta \log \pi_\theta(o|q) A_{\alpha, \beta}(q,o)\right]$

2.3 Adaptive Parameter Selection

Empirically, across sampled $q$ in a batch, estimate $p(q)$ , compute its moments, and fit Beta parameters $(a, b)$ via the method of moments. BNPO then chooses $(\alpha, \beta)$ adaptively to minimize the variance of the BNPO gradient. The optimality theorem states:

$\text{Var}[g_{\alpha,\beta}]$ is finite iff $\alpha < (a+3)/2$ , $\beta < (b+3)/2$
The unique variance minimum is at $\alpha^* = 1 + a/3$ , $\beta^* = 1 + b/3$

This ensures the normalization density at each update is optimally matched to the current policy-induced reward landscape (Xiao et al., 3 Jun 2025).

3. Generalization of REINFORCE and GRPO

BNPO encompasses established techniques as special cases:

$\alpha = \beta = 1$ yields $f_N=1$ and $A_{1,1}(q,o)=R(q,o)-p(q)$ , exactly REINFORCE with baseline.
$\alpha = \beta = 3/2$ gives $f_N \propto \sqrt{p(1-p)}$ , and thus $A_{3/2,3/2} \propto (R-p)/\sqrt{p(1-p)}$ , recovering GRPO up to scaling.

By optimizing $(\alpha,\beta)$ online, BNPO generalizes these fixed normalization approaches, thereby achieving adaptive variance reduction throughout policy optimization.

4. Extension to Multi-Component Rewards: Advantage Decomposition

For tasks with multi-component binary rewards $\{R^{(i)}(q,o)\}_{i=1}^K$ :

$A(q,o) = \frac{1}{K}\sum_{i=1}^K \frac{R^{(i)}(q,o)-p^{(i)}(q)}{f_N(p^{(i)}(q);\alpha^{(i)},\beta^{(i)})}$

Each reward stream is treated independently with its own moment-estimated Beta parameters $(a^{(i)}, b^{(i)})$ and adaptive $(\alpha^{(i)}, \beta^{(i)})$ . This decomposition enables component-wise variance adaptation and normalization, improving efficiency when reward signals saturate or vary asynchronously among components.

5. Algorithmic Procedure and Implementation

The main loop for BNPO is as follows:

For each learning iteration, sample a batch of queries $\{q_j\}$ .
For each $q_j$ , sample model outputs $\{o_{j,k}\}$ , and observe rewards $R(q_j, o_{j,k})$ .
Estimate $p(q_j)\approx \frac{1}{m}\sum_{k} R(q_j,o_{j,k})$ .
Aggregate means and variances of $p(q)$ , fit Beta parameters $(a,b)$ .
Compute $(\alpha, \beta) = 1 + a/3, 1 + b/3$ .
Compute BNPO advantages $A(q_j, o_{j,k})$ .
Use $A(q_j, o_{j,k})$ within PPO’s clipped surrogate loss for policy updates.
Optionally, utilize multi-component reward decomposition as described above.

Computational overhead beyond PPO is minimal; Beta moment calculations per batch are $O(n+m)$ . Numerical stability is maintained by operating in log-density space and clipping $(\alpha, \beta)$ away from the boundaries $(1e^{-3}, (a+3)/2-\epsilon)$ as required (Xiao et al., 3 Jun 2025).

6. Experimental Evaluation and Empirical Performance

BNPO has been evaluated on mathematical reasoning tasks using Qwen2.5-Math-1.5B and Qwen2.5-Math-7B models. Training utilized MATH (7,500 problems) with batch size $32$, $16$ outputs per question, single PPO epoch per step, and $5$ total training epochs.

On standardized benchmarks (pass@1):

For Qwen2.5-Math-1.5B, BNPO achieves an average score of $39.4\%$ , outperforming REINFORCE, ReMax, GRPO, and REINFORCE++.
For Qwen2.5-Math-7B, the average rises to $47.8\%$ , with especially strong improvements on the AMC23 test (+4.3\% over GRPO).

Gradient stability is assessed via the norm $\|\nabla_\theta J\|$ over training steps: BNPO consistently yields the smallest fluctuations, with other methods showing greater variance or slower convergence. In ablation on multi-component rewards (e.g., Qwen2.5-1.5B-Instruct), advantage decomposition grants further modest improvements, and BNPO-based methods consistently attain the best results (Xiao et al., 3 Jun 2025).

Model/Method	MATH500	AMC23	AIME2024	AIME2025	Avg
Qwen2.5-1.5B Base	28.0	27.3	6.0	3.1	16.1
REINFORCE	74.8	51.6	18.3	11.3	39.0
GRPO	75.2	52.3	19.0	9.4	39.0
BNPO	73.4	54.5	18.3	11.5	39.4
Qwen2.5-7B Base	41.4	32.5	11.0	5.0	22.5
REINFORCE	78.4	61.7	34.2	14.6	47.2
GRPO	78.6	64.5	32.3	12.9	47.1
BNPO	77.0	68.8	32.1	13.3	47.8

7. Practical Considerations and Technical Guidance

Use batch size $\geq32$ and $m \geq 16$ outputs per query to stabilize $p(q)$ estimation.
Recompute Beta moments and normalization parameters every step.
Operate in log-density space for $f_N$ to maintain numerical stability.
Clip $(\alpha,\beta)$ to avoid boundaries where the variance blows up.
Gradient clipping is recommended as in PPO to mitigate potential heavy-tailed advantages.
Temperature scheduling (e.g., $1.0$ for training, $0.6$ for evaluation) is effective for balancing exploration and evaluation.
Monitor $(\alpha, \beta)$ trajectories for smooth adaptation, indicating robust normalization dynamics.

BNPO provides a theoretically grounded, low-overhead, and deployable framework for adaptive reward normalization in policy gradient RL with nonstationary binary or multi-component signals. Its generalization of static normalization techniques and empirical superiority on reasoning tasks align it closely with the needs of modern large-model RL optimization (Xiao et al., 3 Jun 2025).

PDF Markdown Chat (Pro)

References (1)

BNPO: Beta Normalization Policy Optimization (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Beta Normalization Policy Optimization (BNPO).