Create a Video View Paper

Turn-Level Rewards for Multi-Turn Agent Training

This presentation explores a breakthrough in training multi-turn reasoning agents through turn-level reward design. The research addresses the critical problem of credit assignment in reinforcement learning for agents that interact across multiple turns, introducing MT-GRPO and MT-PPO methods that provide fine-grained learning signals at each interaction step rather than sparse final outcomes.

Script

Imagine training an AI agent to search and reason across multiple steps, but only knowing if it succeeded at the very end. Which search query helped? Which reasoning step failed? This paper tackles the credit assignment problem in multi-turn agent training through turn-level reward design.

Let's start by understanding why current approaches fall short.

Traditional reinforcement learning for multi-turn agents faces a fundamental challenge. When agents reason, search, and act across multiple turns, sparse final rewards create a credit assignment nightmare.

The key insight is moving from trajectory-level to turn-level reward design. Instead of waiting until the end, the authors provide learning signals at every interaction step.

Now let's explore how they redesigned the training process.

The authors formalize multi-turn interactions as a turn-level MDP. Each turn becomes a decision point with its own state, action, and crucially, its own reward signal.

They introduce two approaches. MT-GRPO extends group-relative policy optimization with turn-specific advantages, while MT-PPO uses a critic network for scalable token-level credit assignment.

Let's examine how this works in practice with a concrete agent design.

Their case study focuses on a reasoning-augmented search agent that iteratively queries Wikipedia. Each turn involves reasoning, searching, and incorporating new information before the final answer.

The reward structure combines intermediate signals for each search turn with final outcome evaluation. Intermediate rewards track retrieval success and format compliance, while outcome rewards focus on answer accuracy.

The implementation uses Qwen2.5-7B as the base model with carefully tuned PPO hyperparameters. Key engineering choices include masking retrieved tokens during policy updates to focus learning on generated content.

The experimental results demonstrate clear advantages across multiple benchmarks.

They evaluate across 6 datasets spanning both general knowledge and multi-hop reasoning tasks. The evaluation focuses on exact match accuracy and format compliance.

MT-PPO delivers the strongest performance with 44.7% average accuracy and virtually perfect format compliance. More importantly, training becomes significantly more stable with faster convergence.

The comparison reveals stark differences in training stability. While traditional methods often crash or require early stopping, turn-level approaches maintain stable learning throughout.

Several important findings emerge from their analysis.

The ablation studies reveal that search penalties are crucial for preventing excessive querying behavior. Without proper regulation, agents can fall into degenerate patterns that collapse training.

The training dynamics show clear benefits. Dense intermediate rewards provide learning signals that traditional sparse rewards simply cannot match, leading to more efficient strategy acquisition.

Like any approach, this method has important limitations to consider.

The approach faces scalability challenges with MT-GRPO and requires careful reward engineering. The evaluation also focuses primarily on search tasks, leaving broader applicability questions.

The broader implications extend well beyond search agents.

This work establishes turn-level reward design as a general mechanism for improving multi-turn agent training. The principles extend to any interactive environment where agents need fine-grained learning signals.

Turn-level reward design transforms how we train multi-turn agents by providing the fine-grained learning signals that sparse trajectory rewards simply cannot deliver. Visit EmergentMind.com to explore more cutting-edge AI research.