Unified Feedback Integration (Reagent-U)
- Unified Feedback Integration (Reagent-U) is a training methodology in agentic RL that unifies textual critiques and scalar scores as learning signals.
- It employs a two-stage reinforcement learning loop with initial trajectory generation and critique-guided refinement to optimize policy performance.
- Empirical benchmarks show that Reagent-U achieves up to a +22.3 percentage point Pass@1 gain over baselines, highlighting its robust impact on agent reasoning and tool-use.
Unified Feedback Integration (Reagent-U) denotes a training methodology within Agentic RL frameworks that integrates multiple channels of structured feedback—explicit reasoning traces, focused textual critiques, and scalar process scores—emitted by a reward model (Agent-RRM) for each agentic trajectory. This unified feedback is used both as contextual guidance during trajectory refinement and as numerically dense learning signals in the reinforcement learning (RL) objective. Extensive benchmark evaluation demonstrates substantial improvements in agent reasoning and tool-use capabilities when adopting Reagent-U.
1. Formal Definition and Reward Structuring
Reagent-U leverages the multi-modal output of Agent Reasoning Reward Model (Agent-RRM), which produces for each trajectory (given a query ):
- a “> ” reasoning trace (textual sequence); > > - a “<critique>” targeted summary (text); > > - an overall scalar “<score>” . > > These outputs are formally interpreted as: > > - : Numeric mapping from the reasoning trace (functionally ); > > - : Numeric mapping from the critique (functionally ); > > - : Scalar score from Agent-RRM. > > The unified reward for each trajectory is defined as > > > > where are weighting coefficients. In the evaluated implementation, only is plugged into the loss numerically; and are set to zero and (typically $0.3$). > > Outcome correctness is incorporated via > > > > The total per-trajectory reward is > > > > ## 2. Training Algorithm and Optimization Objective > > Reagent-U employs a two-stage RL training loop based on Generalized Reward Policy Optimization (GRPO): > > 1. Initial Rollout Generation: For each query , agentic trajectories are sampled using the current policy . Each trajectory is evaluated by Agent-RRM to obtain “think,” “critique,” and “score.” > > 2. Critique-Guided Refinement: Each trajectory is further refined by conditioning the agent’s sampling policy on the original output and its critique. > > 3. Pooling and Reward Computation: Both initial and refined trajectories ($2G$ in total) are evaluated, and their rewards and are normalized to advantages . > > 4. GRPO Loss Computation: For each trajectory , the clipped policy loss and KL penalty are combined: > > > > where and clipping range . > > 5. Parameter Update: Policy parameters are updated using AdamW. > > Key hyperparameters include: (score weight), (clip range), (KL penalty), (rollouts), learning rate . > > ## 3. Comparative Integration Strategies > > Three feedback integration schemes are contrasted: > > | Strategy | Signal Modalities | RL Update? | Pass@1 Gain (GAIA) | > |-------------|---------------------|------------|--------------------| > | Reagent-C | Critique text only | No | +3.8pp | > | Reagent-R | Scalar score only | Yes | +2.9pp | > | Reagent-U | Text + score | Yes | +22.3pp | > > Reagent-U uniquely merges both critique-guided sampling and scalar-signal reward densification, pooling initial and refined rollouts to maximize learning signal. > > Reagent-C (“Text-augmented Refinement”) conditions trajectory sampling on Agent-RRM’s critique, but does not optimize the policy via RL. > Reagent-R (“Reward-augmented Guidance”) utilizes only the scalar score for reward shaping but omits critique in policy context. > Reagent-U (“Unified Feedback Integration”) pools both modalities, leveraging critique for context and scalar scores for objective, and normalizes rewards across both stages. > > ## 4. Empirical Performance and Benchmark Results > > Reagent-U achieves substantial performance gains across general agent and knowledge-intensive benchmarks. On Qwen3-8B backbone: > > - GAIA (text) pass@1: Reagent-U = 43.7% (vs. Reagent-R 36.9%, vs. baseline 21.4%, vs. ARPO-14B 43.7%) > > - WebWalkerQA: Reagent-U = 46.2% (vs. Reagent-R 45.3%, vs. baseline 29.0%) > > - Bamboogle: 76.8% (Reagent-U), 72.8% (Reagent-R), 61.6% (baseline) > > - AIME24: 60.0% (Reagent-U), 53.3% (Reagent-R), 46.7% (baseline) > > Across 12 benchmarks, Reagent-U leads by 5–20 percentage points over sparse-reward baselines, confirming strong empirical effectiveness. > > ## 5. Component Isolations and Synergistic Effect > > Ablation analyses isolate and quantify the contribution of each feedback modality: > > | Strategy | GAIA Pass@1 | Bamboogle | GSM8K | > |------------|-------------|-----------|-------| > | Baseline | 21.4% | 61.6% | 94.6% | > | Reagent-C | 25.2% | — | 94.9% | > | Reagent-R | 36.9% | 72.8% | — | > | Reagent-U | 43.7% | 76.8% | — | > > Stepwise comparison demonstrates that both textual critique and scalar score independently advance performance, and their combination in Reagent-U leads to a synergistic lift. > > ## 6. Mechanistic Rationale and Future Extensions > > The dual-channel feedback mechanism of Reagent-U is posited to facilitate: > > - Local policy refinement via mid-trajectory, fine-grained critique-guided sampling (correcting logic/tool-use errors); > > - Global policy shaping via dense scalar rewards for overall reasoning quality, mitigating sparsity in final-answer correctness. > > By jointly leveraging both information channels, Reagent-U aligns mid-level correction with episodic credit assignment. This enables the agent to internalize debugging instructions and adapt sampling policy accordingly. > > Potential extensions include: > > - Directly incorporating <think> traces and <critique> outputs into training loss functions (e.g., margin-based penalties); > > - Hierarchical RL that partitions the reasoning and action policies; > > - Scaling methodology to larger-scale models and richer toolsets; > > - Integrating human-provided critiques alongside Agent-RRM for diverse feedback. > > This suggests that general agentic-RL approaches should systematically employ both linguistic and numeric feedback signals to maximize the explanatory power of advanced reward models (Fan et al., 29 Jan 2026).