Impact of Large AREW Reweighting Strength on Training Stability

Establish whether large values of the AREW reweighting-strength parameter (Au), which scales the likelihood-margin auxiliary objective used to reweight advantages, amplify the variance of advantage estimates and over-emphasize a small subset of steps, thereby making policy updates brittle and sensitive to noise during reinforcement learning of large language model agents.

Background

The paper introduces AREW, a critique-driven advantage reweighting method that injects stepwise directional critiques through a likelihood-margin auxiliary objective. A scalar parameter (denoted Au in the paper) controls the strength of this reweighting during policy optimization.

Empirically, the authors observe that too-weak reweighting fails to escape the self-locking regime, while too-strong reweighting speeds early optimization but often leads to instability and eventual performance collapse (illustrated in Fig. 5d). Based on these observations, they explicitly conjecture a mechanism linking large Au to increased advantage-variance and overemphasis on a small subset of steps, which would make policy updates brittle and sensitive to noise.

References

We conjecture that large Au amplifies high-variance advantage estimates and over-emphasizes a small subset of steps, making the policy update brittle and sensitive to noise.

On Information Self-Locking in Reinforcement Learning for Active Reasoning of LLM agents  (2603.12109 - Zou et al., 12 Mar 2026) in Section 5.2, Effect of the reweighting strength Au