Experimental Evaluation
CalibAdv is evaluated on three backbone LLMs (Qwen2.5-7B, Qwen2.5-3B, Llama-3.2-3B) and across seven benchmarks, including both multi-hop (e.g., HotpotQA, 2WikiMultiHopQA) and single-hop (e.g., Natural Questions, TriviaQA, PopQA) QA tasks.
Key findings include:
Performance: CalibAdv yields an average F1 improvement of 11.80% over standard GRPO-based agents, outperforming advanced process-supervision baselines such as StepSearch, MT-GRPO, and GiGPO, while requiring zero additional annotation or sampling overhead.
- Stability: CalibAdv completely avoids catastrophic collapse on all models and datasets tested, as measured by the absence of significant PPL or entropy spikes and the successful completion of all training runs.
- Ablations: Each CalibAdv component contributes additively—prepended prompting addresses format collapse, soft penalization reduces incorrect penalties, and advantage rebalancing eliminates high-perplexity outputs altogether.
- Efficiency and Proxy Accuracy: The "silver document" proxy for correctness is shown, via both human and LLM-based assessment, to reliably identify useful intermediate steps (89% human-verified), matching the performance of LLM-based evaluators with much lower resource costs.
Relation to Prior Work and Theoretical Implications
Prior efforts at stabilizing GRPO in deep search scenarios have tackled collapse via either low-level representation strategies (such as precision changes—FP16/BF16), reward filtering (SimpleTIR), post-hoc regularization (LLD), or external supervision for intermediate states (StepSearch, MT-GRPO, CriticSearch, E-GRPO). These typically suffer from limitations in generality, annotation cost, or an inability to address fundamental issues of misattributed penalty and negative signal accumulation.
By shifting the focus to direct, step-level calibration of advantage as a core training signal, CalibAdv reframes the design of RLVR for search-intensive reasoning: it aligns signals with process-level correctness, yielding more interpretable, robust, and generalizable agents. The proxy-based advantage reassignment demonstrates that, for complex multi-step inference, approximate signals derived from the internal retrieval structure can be sufficient for strong downstream improvement.
Implications and Future Directions
The results presented suggest several practical and theoretical avenues:
- Practical Robustness: CalibAdv is model-agnostic and does not need expensive annotation or post-processing. Its incorporation into RL-based training could stabilize and improve a broad class of LLM-augmented search agents.
- Generalized Credit Assignment: This work underscores the importance of process-aligned reward assignment in multi-turn RL, which could motivate analogous strategies in planning, dialogue, and other task-structured RL domains.
- Scalability and Cost: The demonstration that cost-effective proxies suffice for nuanced credit assignment opens opportunities for large-scale, fine-grained RL in reasoning-centric applications.
- Extension to Other Tool-Augmented Agents: While developed for search-based QA, the core approach is applicable to tool-integrated agents performing multi-turn interactions (e.g., database QA, program synthesis, scientific reasoning), contingent on the availability of step-level signal proxies.
Conclusion
The introduction of CalibAdv resolves core deficiencies in GRPO-based training for deep search agents by integrating process-aware, proxy-driven, and stability-centric advantage calibration. Empirical results demonstrate superior performance and robust prevention of training collapse relative to prior methods, highlighting the necessity of fine-grained negative advantage control in multi-hop reasoning environments. This method constitutes a new paradigm for RL credit assignment in agentic LLMs, with implications for the design, evaluation, and deployment of resilient, high-utility reasoning agents.