Advantage Reweighting with Directional Critiques

Updated 19 March 2026

The paper introduces AREW, a method that injects lightweight binary directional critiques into policy-gradient updates to restore effective learning signals.
It uses AS and BT critiques to counteract information self-locking, enhancing action selection and belief tracking in complex multi-turn reasoning tasks.
Experimental results show significant performance boosts across diverse tasks and RL backbones, with improvements robust to noise in critique signals.

Advantage Reweighting by Easy-to-Obtain Directional Critiques (AREW) is a reinforcement learning (RL) augmentation designed to address the phenomenon of information self-locking in active reasoning agents, particularly LLM agents engaged in complex multi-turn tasks such as preference elicitation, medical diagnosis, and troubleshooting. AREW injects stepwise, lightweight binary directional critiques—derived from informativeness and belief-gain heuristics—into policy-gradient methods to reweight per-step advantages, restoring effective learning signals for both the agent’s action selection and belief-tracking processes (Zou et al., 12 Mar 2026).

1. Information Self-Locking in Active Reasoning

Interactive reasoning tasks require agents to determine which queries to make (Action Selection, AS) and to effectively update their internal beliefs given responses (Belief Tracking, BT). Under standard outcome-based RL, where only a terminal reward is provided, a feedback loop can arise: weak BT causes the agent to undervalue informative queries, leading to a cessation of explorative question-asking and entrenchment in low-information states. This phenomenon, termed information self-locking, impedes the agent’s ability to internalize evidence and improve performance. The RL formulation typically models these problems as POMDPs: the agent maintains a belief $b_t \in \Delta(S)$ over hidden state $s^*\in S$ , selects queries according to policy $\pi_\theta(a_t|b_t)$ , and updates beliefs with operator $\mathcal{T}_u(b_t, a_t, o_t)$ after observing $o_t$ . The objective $J(\theta)=\mathbb{E}_{T\sim \pi_\theta}[R(T)]$ is optimized via policy-gradient:

$\nabla_\theta J(\theta) = \mathbb{E}_{T}\left[R(T) \sum_t \nabla_\theta\log \pi_\theta(a_t|b_t)\right].$

However, deficient AS or BT restricts exploration, which in turn hinders improvement of both capabilities, reinforcing information self-locking (Zou et al., 12 Mar 2026).

2. Easy-to-Obtain Directional Critiques

AREW introduces two classes of binary critiques:

AS-critique $z_t^Q \in \{+1, -1, 0\}$ : Indicates whether query $a_t$ $a_{t}$ was informative.
- Example (MediQ): $z_t^Q=+1$ if new clinical facts are elicited, $-1$ for “Unknown” or redundancies, $0$ if no query.
BT-critique $z_t^U \in \{+1, -1, 0\}$ : Reflects whether the belief update increased the agent’s confidence in the ground-truth.
- Example (PE-G): $z_t^U=+1$ if $\cos(w_t,w^*) - \cos(w_{t-1},w^*) > 0$ ; $-1$ otherwise, $0$ if uninformative.

These critiques exploit readily available user feedback and simple agent confidence metrics. Their ease of extraction enables their direct deployment in general multi-turn reasoning settings without modification to the reward structure or the introduction of additional critics.

3. Policy Gradient Reweighting and Objective Modification

The AREW method augments the standard policy-gradient update with an auxiliary within-trajectory margin loss that systematically reallocates probability mass:

Given the sets $P = \{t : z_t=+1\}$ and $N = \{t : z_t=-1\}$ , the margin term is defined as:

$C(\theta;T) = \frac{1}{|P|}\sum_{t \in P} \log \pi_\theta(a_t|b_t) - \frac{1}{|N|}\sum_{t \in N} \log \pi_\theta(a_t|b_t)$

The resulting augmented objective is:

$L_{\rm AREW}(\theta) = J(\theta) + \eta \mathbb{E}_{T}[C(\theta; T)], \quad \eta > 0$

The combined gradient is:

$\nabla_\theta L_{\rm AREW}(\theta) = \mathbb{E}_T\left[\sum_{t=0}^{H-1} (A_t + \eta u_t) \nabla_\theta \log \pi_\theta(a_t|b_t)\right]$

where $u_t = +1/|P|$ for $z_t=+1$ , $-1/|N|$ for $z_t=-1$ , $0$ otherwise. This can be interpreted as reweighting the per-step advantage:

$w_t = 1 + \eta u_t$

so that each advantage $A_t$ is modified by $w_t$ , amplifying learning from positively critiqued steps and suppressing negative ones.

4. Practical Implementation and Training Loop

In typical deployments, advantages $A_t$ are estimated via generalized advantage estimation (GAE) or a learned baseline $\hat{V}(b_t)$ . AREW modifies the PPO-style actor loss as follows:

Standard actor loss:

$L_{\text{actor}} = \sum_t \min\left(r_t(\theta) A_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t\right)$

AREW actor loss:

$L_{\text{actor}}^{\rm AREW} = \sum_t \min\left(r_t(\theta) w_t A_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) w_t A_t\right)$

All other RL components, including value-function regression, entropy regularization, and KL-penalties, remain unmodified.

A typical AREW training loop involves:

Initializing policy $\pi_\theta$ and value $V_\phi$ .
Collecting $N$ episodes under $\pi_\theta$ , recording $(b_t, a_t, o_t)$ .
Computing outcome rewards and GAE-based advantages.
Calculating AS and BT critiques $z_t^Q$ , $z_t^U$ at each step, determining $u_t$ , $w_t$ .
Updating $\theta$ with the AREW-modified actor loss.
Updating $\phi$ to fit $V_\phi(b_t)$ to returns (Zou et al., 12 Mar 2026).

5. Experimental Findings

Experiments were conducted on seven active-reasoning tasks over three domains: preference estimation (PE-G, PE-F), medical diagnosis (MediQ), and troubleshooting (FloDial). The tested backbones included Qwen-2.5B and LLaMA-3-8B, with RL baselines PPO, GRPO, and GSPO. Key metrics were terminal task reward and per-turn AS/BT proxies.

AREW demonstrated substantial improvements. On PE-G ( $S=3$ ), baseline PPO produced a final reward of 18.3%, while AREW (AS+BT) achieved 80.3%. Comparable uplifts appeared in MediQ (+10.8%) and troubleshooting tasks (FloDial-Hard: +21%). Gains were robust across backbone architectures and baseline choices, supporting the generality of the method (Zou et al., 12 Mar 2026).

6. Ablations, Hyperparameters, and Robustness

Ablation studies highlighted the importance of incorporating both AS and BT critiques: adding BT critiques to AS-only shaping can increase performance by 5–25 points. Choice of reweight parameter $\eta$ is critical; values in $[0.01, 0.1]$ yield stable, effective learning, while higher values may induce instability. Noise robustness experiments showed that AREW’s improvements persist even with up to 30–50% random label flips in binary critiques, aligning with theoretical expectations that the method only requires critique accuracy exceeding 50%.

The approach directly reallocates advantage mass toward actually informative steps, circumventing the masking effect of poor BT on AS (and vice versa) without reward shaping or auxiliary critics. This direct reweighting enables sustained exploration and improvement of both agent capabilities.

7. Significance and Applicability

AREW constitutes a lightweight, plug-in strategy for any policy-gradient learning algorithm in multi-turn reasoning settings. Its reliance on easily computed binary critiques enables integration without architectural or environmental modifications. By addressing the feedback loop of information self-locking, AREW enables agents to recover effective exploratory behavior and belief-tracking in environments where learning signals are severely delayed or sparse. The generality and empirical strength across diverse RL backbones and task domains indicate its broad applicability for researchers seeking to enhance LLM-based active reasoning (Zou et al., 12 Mar 2026).

Markdown Report Issue Upgrade to Chat

References (1)

On Information Self-Locking in Reinforcement Learning for Active Reasoning of LLM agents (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Advantage Reweighting by Easy-to-Obtain Directional Critiques (AREW).

Advantage Reweighting with Directional Critiques

1. Information Self-Locking in Active Reasoning

2. Easy-to-Obtain Directional Critiques

3. Policy Gradient Reweighting and Objective Modification

4. Practical Implementation and Training Loop

5. Experimental Findings

6. Ablations, Hyperparameters, and Robustness

7. Significance and Applicability

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Advantage Reweighting with Directional Critiques

1. Information Self-Locking in Active Reasoning

2. Easy-to-Obtain Directional Critiques

3. Policy Gradient Reweighting and Objective Modification

4. Practical Implementation and Training Loop

5. Experimental Findings

6. Ablations, Hyperparameters, and Robustness

7. Significance and Applicability

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research