AWPO: Enhancing Tool-Use of Large Language Models through Explicit Integration of Reasoning Rewards (2512.19126v1)

Published 22 Dec 2025 in cs.CL

Abstract: While reinforcement learning (RL) shows promise in training tool-use LLMs using verifiable outcome rewards, existing methods largely overlook the potential of explicit reasoning rewards to bolster reasoning and tool utilization. Furthermore, natively combining reasoning and outcome rewards may yield suboptimal performance or conflict with the primary optimization objective. To address this, we propose advantage-weighted policy optimization (AWPO) -- a principled RL framework that effectively integrates explicit reasoning rewards to enhance tool-use capability. AWPO incorporates variance-aware gating and difficulty-aware weighting to adaptively modulate advantages from reasoning signals based on group-relative statistics, alongside a tailored clipping mechanism for stable optimization. Extensive experiments demonstrate that AWPO achieves state-of-the-art performance across standard tool-use benchmarks, significantly outperforming strong baselines and leading closed-source models in challenging multi-turn scenarios. Notably, with exceptional parameter efficiency, our 4B model surpasses Grok-4 by 16.0 percent in multi-turn accuracy while preserving generalization capability on the out-of-distribution MMLU-Pro benchmark.

Summary

The paper presents AWPO, a novel framework that adaptively integrates reasoning rewards with traditional outcome-based reinforcement learning to enhance LLM tool-use.
It introduces variance-aware gating, difficulty-aware weighting, and adaptive clipping to stabilize optimization and improve multi-turn, compositional decision-making.
Empirical results reveal significant accuracy gains on benchmarks, with improvements in both tool-use tasks and general language understanding.

AWPO: Integration of Reasoning Rewards for Tool-Use RL in LLMs

Motivation and Problem Formulation

Recent advances in LLMs have expanded their utility through tool-use—compositional, multi-step invocation of external APIs and functions. Standard post-training paradigms for tool-capable LLMs rely on either supervised fine-tuning (SFT) or reinforcement learning (RL) with outcome-based rewards, yet both approaches exhibit fundamental limitations. Supervised approaches suffer from overfitting to demonstration trajectories and insufficient generalization. RL-based methods, such as Group-Relative Policy Optimization (GRPO) and its variants, focus on verifiable outcome rewards but largely overlook explicit feedback on the intermediate reasoning steps, which is critical for robust reasoning and compositional planning. Naive amalgamation of reasoning and outcome rewards generates optimization instability due to conflicting learning signals.

The AWPO Framework

To address reward integration challenges, "AWPO: Enhancing Tool-Use of LLMs through Explicit Integration of Reasoning Rewards" (2512.19126) introduces the Advantage-Weighted Policy Optimization (AWPO) framework. AWPO extends the GRPO family by adaptively integrating fine-grained reasoning rewards, derived using LLM-as-a-Judge protocols, into the RL post-training pipeline.

AWPO’s core architectural innovations include:

Variance-aware gating: Reasoning rewards are selectively weighted in advantage computation based on their group-relative discriminative variance compared to outcome rewards. AWPO exploits reward variance as a proxy for informative signal, following a theoretically-motivated upper bound on policy improvement.
Difficulty-aware weighting: Group-level weights prioritize samples of medium optimization difficulty, which offer maximal learning signal and prevent over-optimization on trivial or saturated trajectory groups.
Adaptive clipping: Dynamic adjustment of the trust region according to batch-level reliance on mixed (reasoning) rewards ensures optimization stability, especially when integrating high-variance supervisory signals.

AWPO leverages LLM-judged, multi-dimensional chain-of-thought (CoT) rubric scores that reflect logical consistency, tool selection, parameter correctness, and execution strategy. These fine-grained, structured signals complement rule-driven outcome supervision. The reward mixing strategy is performed group-wise and is dynamically gated; AWPO avoids over-reliance on single reward sources, thus mitigating the instability observed with naive reward mixing.

Theoretical Analysis

AWPO’s policy improvement is rigorously grounded in an extension of classical RL analysis. The authors decompose expected improvement into signal (policy gradient norm modulated by advantage variance and its Fisher-normalized alignment) and noise (stochastic gradient variance). Formal upper bounds demonstrate:

Sufficient variance in the effective advantage—augmented by LLM-judged reasoning signals—enlarges the theoretical optimization capacity, particularly in late-stage RL training scenarios where outcome reward variance is saturated.
Adaptive blending of reasoning rewards, constrained by empirical variance and difficulty-level, can provably increase the signal-to-noise ratio in gradient estimation.

This analysis clarifies the limitations of outcome-only RL and provides a strong theoretical foundation for AWPO’s multi-level gating and weighting mechanisms.

Experimental Results and Empirical Claims

AWPO achieves state-of-the-art performance on comprehensive tool-use benchmarks:

On the Berkeley Function-Calling Leaderboard (BFCL), AWPO's 4B-parameter model attains 52.12% multi-turn and 73.20% overall accuracy, representing a 25.2% relative improvement in multi-turn accuracy over strong GRPO-style baselines and outperforming closed-source models such as Grok-4 by 16.0% in multi-turn settings.
API-Bank results confirm AWPO's gains, with the 8B model improving Level-3 (most compositional, multi-step) task accuracy by 15.27 points over ToolRL, a 37.7% relative gain.
Importantly, on the out-of-distribution MMLU-Pro benchmark—targeting general language understanding without tool-use—AWPO preserves or marginally enhances the base model’s abilities (+1.06% to +1.47% accuracy increments), demonstrating high retention of core knowledge and absence of tool-use overfitting.

Ablation studies attribute most of the observed gains in multi-turn and compositional settings directly to variance-aware gating and difficulty-aware weighting. Fixed or naive reward mixing, and removal of either strategy, consistently regress performance.

Practical, Theoretical, and Future Implications

In practical terms, AWPO provides a scalable, parameter-efficient, and robust approach to tool-use LLM post-training. With rigorous reward design and dynamic integration mechanisms, AWPO turns previously fragile auxiliary reasoning signals into stable drivers of policy improvement, enabling smaller LLMs to surpass much larger closed-source and open models on established tool-use benchmarks. This has implications for democratized, cost-effective agentic systems using sub-10B parameter LLMs.

Theoretically, the AWPO framework and its policy improvement bounds offer a template for integrating auxiliary signals—whether LLM-judged, process-supervised, or otherwise—into outcome-driven RL. The policy improvement decomposition applies broadly and could be leveraged for reward design in other domains with intermediate structure (e.g., program synthesis, planning, compositional reasoning).

Future developments may focus on:

Extending AWPO's variance-aware integration to additional auxiliary reward modalities (e.g., human preference, process-level safety metrics).
Joint learning of judge models and policy models, or exploration of adversarial judge-policy setups.
Automated curriculum learning over difficulty bands, further optimizing group weighting to maximize sample efficiency and transfer.

Conclusion

AWPO represents a principled and empirically validated solution to the reward integration problem in tool-use LLM post-training. By dynamically blending outcome and reasoning rewards through theoretically grounded variance-aware mechanisms, AWPO consistently advances tool-augmented LLM performance without sacrificing generalization—suggesting a general recipe for injecting structured reasoning supervision into outcome-driven RL for LLMs.