Advantage Reweighting with Directional Critiques
- The paper introduces AREW, a method that injects lightweight binary directional critiques into policy-gradient updates to restore effective learning signals.
- It uses AS and BT critiques to counteract information self-locking, enhancing action selection and belief tracking in complex multi-turn reasoning tasks.
- Experimental results show significant performance boosts across diverse tasks and RL backbones, with improvements robust to noise in critique signals.
Advantage Reweighting by Easy-to-Obtain Directional Critiques (AREW) is a reinforcement learning (RL) augmentation designed to address the phenomenon of information self-locking in active reasoning agents, particularly LLM agents engaged in complex multi-turn tasks such as preference elicitation, medical diagnosis, and troubleshooting. AREW injects stepwise, lightweight binary directional critiques—derived from informativeness and belief-gain heuristics—into policy-gradient methods to reweight per-step advantages, restoring effective learning signals for both the agent’s action selection and belief-tracking processes (Zou et al., 12 Mar 2026).
1. Information Self-Locking in Active Reasoning
Interactive reasoning tasks require agents to determine which queries to make (Action Selection, AS) and to effectively update their internal beliefs given responses (Belief Tracking, BT). Under standard outcome-based RL, where only a terminal reward is provided, a feedback loop can arise: weak BT causes the agent to undervalue informative queries, leading to a cessation of explorative question-asking and entrenchment in low-information states. This phenomenon, termed information self-locking, impedes the agent’s ability to internalize evidence and improve performance. The RL formulation typically models these problems as POMDPs: the agent maintains a belief over hidden state , selects queries according to policy , and updates beliefs with operator after observing . The objective is optimized via policy-gradient:
However, deficient AS or BT restricts exploration, which in turn hinders improvement of both capabilities, reinforcing information self-locking (Zou et al., 12 Mar 2026).
2. Easy-to-Obtain Directional Critiques
AREW introduces two classes of binary critiques:
- AS-critique : Indicates whether query was informative.
- Example (MediQ): if new clinical facts are elicited, for “Unknown” or redundancies, $0$ if no query.
- BT-critique : Reflects whether the belief update increased the agent’s confidence in the ground-truth.
- Example (PE-G): if ; otherwise, $0$ if uninformative.
These critiques exploit readily available user feedback and simple agent confidence metrics. Their ease of extraction enables their direct deployment in general multi-turn reasoning settings without modification to the reward structure or the introduction of additional critics.
3. Policy Gradient Reweighting and Objective Modification
The AREW method augments the standard policy-gradient update with an auxiliary within-trajectory margin loss that systematically reallocates probability mass:
Given the sets and , the margin term is defined as:
The resulting augmented objective is:
The combined gradient is:
where for , for , $0$ otherwise. This can be interpreted as reweighting the per-step advantage:
so that each advantage is modified by , amplifying learning from positively critiqued steps and suppressing negative ones.
4. Practical Implementation and Training Loop
In typical deployments, advantages are estimated via generalized advantage estimation (GAE) or a learned baseline . AREW modifies the PPO-style actor loss as follows:
- Standard actor loss:
- AREW actor loss:
All other RL components, including value-function regression, entropy regularization, and KL-penalties, remain unmodified.
A typical AREW training loop involves:
- Initializing policy and value .
- Collecting episodes under , recording .
- Computing outcome rewards and GAE-based advantages.
- Calculating AS and BT critiques , at each step, determining , .
- Updating with the AREW-modified actor loss.
- Updating to fit to returns (Zou et al., 12 Mar 2026).
5. Experimental Findings
Experiments were conducted on seven active-reasoning tasks over three domains: preference estimation (PE-G, PE-F), medical diagnosis (MediQ), and troubleshooting (FloDial). The tested backbones included Qwen-2.5B and LLaMA-3-8B, with RL baselines PPO, GRPO, and GSPO. Key metrics were terminal task reward and per-turn AS/BT proxies.
AREW demonstrated substantial improvements. On PE-G (), baseline PPO produced a final reward of 18.3%, while AREW (AS+BT) achieved 80.3%. Comparable uplifts appeared in MediQ (+10.8%) and troubleshooting tasks (FloDial-Hard: +21%). Gains were robust across backbone architectures and baseline choices, supporting the generality of the method (Zou et al., 12 Mar 2026).
6. Ablations, Hyperparameters, and Robustness
Ablation studies highlighted the importance of incorporating both AS and BT critiques: adding BT critiques to AS-only shaping can increase performance by 5–25 points. Choice of reweight parameter is critical; values in yield stable, effective learning, while higher values may induce instability. Noise robustness experiments showed that AREW’s improvements persist even with up to 30–50% random label flips in binary critiques, aligning with theoretical expectations that the method only requires critique accuracy exceeding 50%.
The approach directly reallocates advantage mass toward actually informative steps, circumventing the masking effect of poor BT on AS (and vice versa) without reward shaping or auxiliary critics. This direct reweighting enables sustained exploration and improvement of both agent capabilities.
7. Significance and Applicability
AREW constitutes a lightweight, plug-in strategy for any policy-gradient learning algorithm in multi-turn reasoning settings. Its reliance on easily computed binary critiques enables integration without architectural or environmental modifications. By addressing the feedback loop of information self-locking, AREW enables agents to recover effective exploratory behavior and belief-tracking in environments where learning signals are severely delayed or sparse. The generality and empirical strength across diverse RL backbones and task domains indicate its broad applicability for researchers seeking to enhance LLM-based active reasoning (Zou et al., 12 Mar 2026).