- The paper introduces a novel turn-level credit assignment strategy that provides precise feedback for each decision in multi-turn LLM interactions.
- It models multi-turn tasks as Markov Decision Processes, effectively combining immediate and outcome rewards to guide LLM agents.
- Experiments demonstrate that MT-GRPO outperforms baselines in tool execution, exact match accuracy, and training stability.
This paper (2505.11821) addresses the challenge of training LLMs to act as effective agents in multi-turn environments using Reinforcement Learning (RL). While RL has shown promise in improving LLM reasoning, applying it to tasks requiring sequential interaction with external tools (like search engines, calculators, etc.) faces a key hurdle: poor credit assignment. Existing methods often model multi-turn tasks as bandit problems, assigning credit based on the final outcome of an entire interaction trajectory. This makes it difficult for the agent to learn which specific steps or "turns" contributed positively or negatively to the result, hindering performance on complex, long-horizon reasoning tasks.
To overcome this, the authors propose two main contributions:
- Modeling Multi-Turn Interaction as an MDP: They frame multi-turn tool-use tasks as Markov Decision Processes (MDPs), which inherently capture the sequential nature of decisions and environmental feedback. This moves away from the bandit formulation used in many prior works.
- Turn-Level Credit Assignment: They introduce a fine-grained strategy for estimating advantages at the turn level, rather than just the trajectory level. This allows the agent to learn from feedback on individual steps, incorporating both turn-level (e.g., successful tool use) and outcome-level (e.g., correct final answer) rewards more effectively.
The proposed strategy is demonstrated by integrating it into the Group Relative Preference Optimization (GRPO) algorithm, resulting in Multi-Turn GRPO (MT-GRPO). The core idea of MT-GRPO's advantage estimation for a two-turn scenario is to combine turn-level advantages (A^T) derived from immediate rewards and outcome-level advantages (A^O) derived from the final rewards. Specifically, the advantage for the first turn (which includes reasoning and tool calling) is a combination of turn and outcome advantages, while the advantage for the second turn (which includes final reasoning and answer generation) is based solely on the outcome advantage.
For an interaction trajectory i with turn reward RiT and outcome reward RiO, the turn-level advantages in MT-GRPO are calculated as:
A^i,1MT-GRPO=A^iT+λA^iO
A^i,2MT-GRPO=A^iO
where λ is a scaling coefficient, and A^iT and A^iO are calculated using GRPO's group-relative approach:
A^iT=std({RiT}i=1G)RiT−mean({RiT}i=1G)
A^iO=std({RiO}i=1G)RiO−mean({RiO}i=1G)
This formulation ensures that the agent gets direct feedback for its initial decision (tool use) while also considering the overall success of the trajectory for both turns. The authors note that this turn-level advantage estimation strategy can be adapted to other RL algorithms beyond GRPO.
To evaluate their approach, the authors implement a simplified two-turn agent using the Qwen2.5-7B model that interacts with a Wikipedia search tool to answer questions from the TriviaQA dataset. The interaction flow is defined as reasoning -> search -> result -> reasoning -> answer
, enforced by strict XML tagging in the system prompt and environment parsing. The environment provides verifiable rewards:
- Turn-Level: Tool Execution (checking correct tool call and no environment error), Search Result Answer Presence (checking if the ground truth appears in search results).
- Outcome-Level: Final Answer Presence, Exact Match (comparing agent's answer to ground truth), XML Format, XML Tag Usage (checking output structure and tag correctness).
Experiments compare MT-GRPO against baseline GRPO variants: GRPO-OR (using only outcome rewards) and GRPO-MR (merging outcome and turn rewards at the trajectory level, Ri=RiO+RiT).
The results demonstrate the practical benefits of turn-level credit assignment:
- Tool Execution: MT-GRPO achieves 100% success in correctly invoking the search tool during training and evaluation. GRPO-MR incorporating turn rewards also performs well, but GRPO-OR, lacking specific turn feedback, often fails to use the tool correctly.
- Answer Accuracy: MT-GRPO significantly outperforms baselines in exact match accuracy (50% vs. 33.46% for GRPO-MR and 0% for GRPO-OR on validation).
- Training Stability: MT-GRPO shows more stable training curves and lower variance across multiple runs compared to the baselines, indicating more reliable learning of the desired multi-turn behavior.
The implementation relies on verifiable rewards and structured interaction via XML tags, defining a clear state and action space within the multi-turn sequence. The use of vLLM for efficient rollouts and Huggingface TRL for training demonstrates a practical setup for applying RL to LLMs. The code for the project is available, which is a valuable resource for practitioners wanting to implement similar multi-turn RL agents.
While the current work focuses on a two-turn environment, the authors highlight that the core idea of turn-level credit assignment is general and crucial for scaling RL to more complex, longer multi-turn agent tasks. Future work will explore extending the approach to these more complex scenarios and potentially moving beyond predefined verifiable rewards.