Impact of Large AREW Reweighting Strength on Training Stability
Establish whether large values of the AREW reweighting-strength parameter (Au), which scales the likelihood-margin auxiliary objective used to reweight advantages, amplify the variance of advantage estimates and over-emphasize a small subset of steps, thereby making policy updates brittle and sensitive to noise during reinforcement learning of large language model agents.
References
We conjecture that large Au amplifies high-variance advantage estimates and over-emphasizes a small subset of steps, making the policy update brittle and sensitive to noise.
— On Information Self-Locking in Reinforcement Learning for Active Reasoning of LLM agents
(2603.12109 - Zou et al., 12 Mar 2026) in Section 5.2, Effect of the reweighting strength Au