Papers
Topics
Authors
Recent
Search
2000 character limit reached

Advantage Reweighting with Directional Critiques

Updated 19 March 2026
  • The paper introduces AREW, a method that injects lightweight binary directional critiques into policy-gradient updates to restore effective learning signals.
  • It uses AS and BT critiques to counteract information self-locking, enhancing action selection and belief tracking in complex multi-turn reasoning tasks.
  • Experimental results show significant performance boosts across diverse tasks and RL backbones, with improvements robust to noise in critique signals.

Advantage Reweighting by Easy-to-Obtain Directional Critiques (AREW) is a reinforcement learning (RL) augmentation designed to address the phenomenon of information self-locking in active reasoning agents, particularly LLM agents engaged in complex multi-turn tasks such as preference elicitation, medical diagnosis, and troubleshooting. AREW injects stepwise, lightweight binary directional critiques—derived from informativeness and belief-gain heuristics—into policy-gradient methods to reweight per-step advantages, restoring effective learning signals for both the agent’s action selection and belief-tracking processes (Zou et al., 12 Mar 2026).

1. Information Self-Locking in Active Reasoning

Interactive reasoning tasks require agents to determine which queries to make (Action Selection, AS) and to effectively update their internal beliefs given responses (Belief Tracking, BT). Under standard outcome-based RL, where only a terminal reward is provided, a feedback loop can arise: weak BT causes the agent to undervalue informative queries, leading to a cessation of explorative question-asking and entrenchment in low-information states. This phenomenon, termed information self-locking, impedes the agent’s ability to internalize evidence and improve performance. The RL formulation typically models these problems as POMDPs: the agent maintains a belief btΔ(S)b_t \in \Delta(S) over hidden state sSs^*\in S, selects queries according to policy πθ(atbt)\pi_\theta(a_t|b_t), and updates beliefs with operator Tu(bt,at,ot)\mathcal{T}_u(b_t, a_t, o_t) after observing oto_t. The objective J(θ)=ETπθ[R(T)]J(\theta)=\mathbb{E}_{T\sim \pi_\theta}[R(T)] is optimized via policy-gradient:

θJ(θ)=ET[R(T)tθlogπθ(atbt)].\nabla_\theta J(\theta) = \mathbb{E}_{T}\left[R(T) \sum_t \nabla_\theta\log \pi_\theta(a_t|b_t)\right].

However, deficient AS or BT restricts exploration, which in turn hinders improvement of both capabilities, reinforcing information self-locking (Zou et al., 12 Mar 2026).

2. Easy-to-Obtain Directional Critiques

AREW introduces two classes of binary critiques:

  • AS-critique ztQ{+1,1,0}z_t^Q \in \{+1, -1, 0\}: Indicates whether query ata_t was informative.
    • Example (MediQ): ztQ=+1z_t^Q=+1 if new clinical facts are elicited, 1-1 for “Unknown” or redundancies, $0$ if no query.
  • BT-critique ztU{+1,1,0}z_t^U \in \{+1, -1, 0\}: Reflects whether the belief update increased the agent’s confidence in the ground-truth.
    • Example (PE-G): ztU=+1z_t^U=+1 if cos(wt,w)cos(wt1,w)>0\cos(w_t,w^*) - \cos(w_{t-1},w^*) > 0; 1-1 otherwise, $0$ if uninformative.

These critiques exploit readily available user feedback and simple agent confidence metrics. Their ease of extraction enables their direct deployment in general multi-turn reasoning settings without modification to the reward structure or the introduction of additional critics.

3. Policy Gradient Reweighting and Objective Modification

The AREW method augments the standard policy-gradient update with an auxiliary within-trajectory margin loss that systematically reallocates probability mass:

Given the sets P={t:zt=+1}P = \{t : z_t=+1\} and N={t:zt=1}N = \{t : z_t=-1\}, the margin term is defined as:

C(θ;T)=1PtPlogπθ(atbt)1NtNlogπθ(atbt)C(\theta;T) = \frac{1}{|P|}\sum_{t \in P} \log \pi_\theta(a_t|b_t) - \frac{1}{|N|}\sum_{t \in N} \log \pi_\theta(a_t|b_t)

The resulting augmented objective is:

LAREW(θ)=J(θ)+ηET[C(θ;T)],η>0L_{\rm AREW}(\theta) = J(\theta) + \eta \mathbb{E}_{T}[C(\theta; T)], \quad \eta > 0

The combined gradient is:

θLAREW(θ)=ET[t=0H1(At+ηut)θlogπθ(atbt)]\nabla_\theta L_{\rm AREW}(\theta) = \mathbb{E}_T\left[\sum_{t=0}^{H-1} (A_t + \eta u_t) \nabla_\theta \log \pi_\theta(a_t|b_t)\right]

where ut=+1/Pu_t = +1/|P| for zt=+1z_t=+1, 1/N-1/|N| for zt=1z_t=-1, $0$ otherwise. This can be interpreted as reweighting the per-step advantage:

wt=1+ηutw_t = 1 + \eta u_t

so that each advantage AtA_t is modified by wtw_t, amplifying learning from positively critiqued steps and suppressing negative ones.

4. Practical Implementation and Training Loop

In typical deployments, advantages AtA_t are estimated via generalized advantage estimation (GAE) or a learned baseline V^(bt)\hat{V}(b_t). AREW modifies the PPO-style actor loss as follows:

  • Standard actor loss:

Lactor=tmin(rt(θ)At,clip(rt(θ),1ϵ,1+ϵ)At)L_{\text{actor}} = \sum_t \min\left(r_t(\theta) A_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t\right)

  • AREW actor loss:

LactorAREW=tmin(rt(θ)wtAt,clip(rt(θ),1ϵ,1+ϵ)wtAt)L_{\text{actor}}^{\rm AREW} = \sum_t \min\left(r_t(\theta) w_t A_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) w_t A_t\right)

All other RL components, including value-function regression, entropy regularization, and KL-penalties, remain unmodified.

A typical AREW training loop involves:

  1. Initializing policy πθ\pi_\theta and value VϕV_\phi.
  2. Collecting NN episodes under πθ\pi_\theta, recording (bt,at,ot)(b_t, a_t, o_t).
  3. Computing outcome rewards and GAE-based advantages.
  4. Calculating AS and BT critiques ztQz_t^Q, ztUz_t^U at each step, determining utu_t, wtw_t.
  5. Updating θ\theta with the AREW-modified actor loss.
  6. Updating ϕ\phi to fit Vϕ(bt)V_\phi(b_t) to returns (Zou et al., 12 Mar 2026).

5. Experimental Findings

Experiments were conducted on seven active-reasoning tasks over three domains: preference estimation (PE-G, PE-F), medical diagnosis (MediQ), and troubleshooting (FloDial). The tested backbones included Qwen-2.5B and LLaMA-3-8B, with RL baselines PPO, GRPO, and GSPO. Key metrics were terminal task reward and per-turn AS/BT proxies.

AREW demonstrated substantial improvements. On PE-G (S=3S=3), baseline PPO produced a final reward of 18.3%, while AREW (AS+BT) achieved 80.3%. Comparable uplifts appeared in MediQ (+10.8%) and troubleshooting tasks (FloDial-Hard: +21%). Gains were robust across backbone architectures and baseline choices, supporting the generality of the method (Zou et al., 12 Mar 2026).

6. Ablations, Hyperparameters, and Robustness

Ablation studies highlighted the importance of incorporating both AS and BT critiques: adding BT critiques to AS-only shaping can increase performance by 5–25 points. Choice of reweight parameter η\eta is critical; values in [0.01,0.1][0.01, 0.1] yield stable, effective learning, while higher values may induce instability. Noise robustness experiments showed that AREW’s improvements persist even with up to 30–50% random label flips in binary critiques, aligning with theoretical expectations that the method only requires critique accuracy exceeding 50%.

The approach directly reallocates advantage mass toward actually informative steps, circumventing the masking effect of poor BT on AS (and vice versa) without reward shaping or auxiliary critics. This direct reweighting enables sustained exploration and improvement of both agent capabilities.

7. Significance and Applicability

AREW constitutes a lightweight, plug-in strategy for any policy-gradient learning algorithm in multi-turn reasoning settings. Its reliance on easily computed binary critiques enables integration without architectural or environmental modifications. By addressing the feedback loop of information self-locking, AREW enables agents to recover effective exploratory behavior and belief-tracking in environments where learning signals are severely delayed or sparse. The generality and empirical strength across diverse RL backbones and task domains indicate its broad applicability for researchers seeking to enhance LLM-based active reasoning (Zou et al., 12 Mar 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Advantage Reweighting by Easy-to-Obtain Directional Critiques (AREW).