Negative Advantage Is a Double-Edged Sword: Calibrating Advantage in GRPO for Deep Search

Published 20 Apr 2026 in cs.CL and cs.AI | (2604.18235v1)

Abstract: Deep search agents can autonomously initiate multi-turn interactions with search engines, thereby exhibiting strong question-answering capabilities. Such performance critically relies on Group Relative Policy Optimization (GRPO) as its core training algorithm. However, GRPO still faces several challenges in deep search settings. First, there exists a substantial mismatch between the correctness of intermediate steps and the reward signal, causing numerous correct intermediate steps to be incorrectly penalized when the final answer is wrong. Second, training is highly unstable, often resulting in degradation of natural language ability or even catastrophic training collapse. Our analysis attributes these issues to coarse-grained advantage assignment and an imbalance between positive and negative advantages. To address these problems, we propose CalibAdv, an advantage calibration method specifically designed for deep search tasks. Specifically, CalibAdv leverages the correctness of intermediate steps to downscale excessive negative advantages at a fine-grained level. It then rebalances positive and negative advantages in the answer component. Extensive experiments across three models and seven benchmarks demonstrate that CalibAdv improves both model performance and training stability. Our code is available at https://github.com/wujwyi/CalibAdv.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper introduces CalibAdv, which fine-tunes advantage signals in GRPO to mitigate misattribution of penalties and avoid training collapse.
It employs soft penalization of intermediate steps and dynamic rebalancing of advantage ratios to stabilize model outputs.
Experimental results demonstrate an average F1 improvement of 11.80% and robust prevention of catastrophic collapse across various QA benchmarks.

Calibrating Negative Advantage in GRPO for Robust Deep Search Agents

Overview

"Negative Advantage Is a Double-Edged Sword: Calibrating Advantage in GRPO for Deep Search" (2604.18235) addresses the instability and suboptimal learning dynamics observed in Group Relative Policy Optimization (GRPO) when applied to multi-turn, search-based QA agents. The paper introduces CalibAdv, a method for fine-grained calibration of advantage signals, particularly focusing on the nuanced use and attribution of negative advantage throughout the interaction process.

Challenges in GRPO-Based Deep Search

Deep search agents, which repeatedly interact with search engines in reasoning-intensive tasks, have their learning predominantly governed by reinforcement learning with verifiable rewards (RLVR) using reward signals based on the correctness of only the final output. The standard GRPO implementation propagates these outcome-based rewards evenly across all tokens or steps, leading to two major issues:

Misattribution of Penalties: Tokens corresponding to correct intermediate steps often receive negative advantage if the final output is wrong, distorting the learning signal and discouraging reasoning trajectories that were partially effective.
Training Instability and Collapse: Protracted dominance of negative advantage results in increased output entropy and model perplexity, culminating in catastrophic language degradation or repetitive, non-sensical outputs—manifesting as training collapse.

The CalibAdv Approach

CalibAdv augments standard GRPO with targeted interventions at both process and outcome levels:

Soft Penalization of Intermediate Steps: CalibAdv estimates the correctness of intermediate retrieval steps using a "silver document" proxy derived from the set of documents retrieved in correct (successful) rollouts. Penalties across intermediate steps are then reweighted: negative advantage is downscaled in proportion to the overlap with silver documents. This reduces the penalization of steps that contribute usefully, even in otherwise unsuccessful attempts at the final answer.
Advantage Rebalancing at the Final Step: Recognizing that unbalanced dominance of negative advantage over time triggers language collapse, CalibAdv monitors and dynamically rescales the ratio of positive to negative advantage for the answer-token group. Positive advantage is amplified when the ratio skews negative, stabilizing training and safeguarding the model’s natural language generation ability.
Special Token Decoupling: The prompt structure of deep search agents often includes high-frequency format tokens such as >. These are decoupled from advantage signals entirely (by prepending rather than generating them), avoiding the amplification of instability through forced model changes in fixed-format regions.
Experimental Evaluation

CalibAdv is evaluated on three backbone LLMs (Qwen2.5-7B, Qwen2.5-3B, Llama-3.2-3B) and across seven benchmarks, including both multi-hop (e.g., HotpotQA, 2WikiMultiHopQA) and single-hop (e.g., Natural Questions, TriviaQA, PopQA) QA tasks.

Key findings include:
- Performance: CalibAdv yields an average F1 improvement of 11.80% over standard GRPO-based agents, outperforming advanced process-supervision baselines such as StepSearch, MT-GRPO, and GiGPO, while requiring zero additional annotation or sampling overhead.
- Stability: CalibAdv completely avoids catastrophic collapse on all models and datasets tested, as measured by the absence of significant PPL or entropy spikes and the successful completion of all training runs.
- Ablations: Each CalibAdv component contributes additively—prepended prompting addresses format collapse, soft penalization reduces incorrect penalties, and advantage rebalancing eliminates high-perplexity outputs altogether.
- Efficiency and Proxy Accuracy: The "silver document" proxy for correctness is shown, via both human and LLM-based assessment, to reliably identify useful intermediate steps (89% human-verified), matching the performance of LLM-based evaluators with much lower resource costs.
Relation to Prior Work and Theoretical Implications

Prior efforts at stabilizing GRPO in deep search scenarios have tackled collapse via either low-level representation strategies (such as precision changes—FP16/BF16), reward filtering (SimpleTIR), post-hoc regularization (LLD), or external supervision for intermediate states (StepSearch, MT-GRPO, CriticSearch, E-GRPO). These typically suffer from limitations in generality, annotation cost, or an inability to address fundamental issues of misattributed penalty and negative signal accumulation.

By shifting the focus to direct, step-level calibration of advantage as a core training signal, CalibAdv reframes the design of RLVR for search-intensive reasoning: it aligns signals with process-level correctness, yielding more interpretable, robust, and generalizable agents. The proxy-based advantage reassignment demonstrates that, for complex multi-step inference, approximate signals derived from the internal retrieval structure can be sufficient for strong downstream improvement.

Implications and Future Directions

The results presented suggest several practical and theoretical avenues:
- Practical Robustness: CalibAdv is model-agnostic and does not need expensive annotation or post-processing. Its incorporation into RL-based training could stabilize and improve a broad class of LLM-augmented search agents.
- Generalized Credit Assignment: This work underscores the importance of process-aligned reward assignment in multi-turn RL, which could motivate analogous strategies in planning, dialogue, and other task-structured RL domains.
- Scalability and Cost: The demonstration that cost-effective proxies suffice for nuanced credit assignment opens opportunities for large-scale, fine-grained RL in reasoning-centric applications.
- Extension to Other Tool-Augmented Agents: While developed for search-based QA, the core approach is applicable to tool-integrated agents performing multi-turn interactions (e.g., database QA, program synthesis, scientific reasoning), contingent on the availability of step-level signal proxies.
Conclusion

The introduction of CalibAdv resolves core deficiencies in GRPO-based training for deep search agents by integrating process-aware, proxy-driven, and stability-centric advantage calibration. Empirical results demonstrate superior performance and robust prevention of training collapse relative to prior methods, highlighting the necessity of fine-grained negative advantage control in multi-hop reasoning environments. This method constitutes a new paradigm for RL credit assignment in agentic LLMs, with implications for the design, evaluation, and deployment of resilient, high-utility reasoning agents.