Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 43 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 21 tok/s Pro
GPT-5 High 20 tok/s Pro
GPT-4o 95 tok/s Pro
Kimi K2 180 tok/s Pro
GPT OSS 120B 443 tok/s Pro
Claude Sonnet 4.5 32 tok/s Pro
2000 character limit reached

Process-Supervised Reinforcement Learning for Interactive Multimodal Tool-Use Agents (2509.14480v1)

Published 17 Sep 2025 in cs.CL, cs.AI, and cs.MA

Abstract: Effective interactive tool use requires agents to master Tool Integrated Reasoning (TIR): a complex process involving multi-turn planning and long-context dialogue management. To train agents for this dynamic process, particularly in multi-modal contexts, we introduce a sandbox environment for reinforcement learning (RL) that supports interleaved speech-text rollouts. Our core strategy, Turn-level Adjudicated Reinforcement Learning (TARL), addresses the challenge of credit assignment in long-horizon tasks by employing a LLM as a judge to provide turn-level evaluation. To enhance exploration, we integrate a mixed-task training curriculum with mathematical reasoning problems. This unified approach boosts the task pass rate on the text-based $\tau$-bench by over 6% compared to strong RL baselines. Crucially, we demonstrate our framework's suitability for fine-tuning a multi-modal foundation model for agentic tasks. By training a base multi-modal LLM on interleaved speech-text rollouts, we equip it with tool-use abilities, paving the way for more natural, voice-driven interactive agents.

Summary

  • The paper introduces TARL, a turn-level adjudicated RL framework, to achieve fine-grained credit assignment in multimodal, multi-turn tool-use interactions.
  • The approach combines robust sandbox environments and mixed-task curricula to yield significant improvements in performance metrics and extended response lengths.
  • Experimental results underscore the necessity of mixed-modality training for stable and generalizable tool-use in both text and speech settings.

Process-Supervised RL for Interactive Multimodal Tool-Use Agents

Introduction and Motivation

The paper addresses the challenge of training interactive agents capable of robust tool use in multimodal, multi-turn settings, with a particular focus on integrating speech and text modalities. The core problem is the development of agents that can perform Tool Integrated Reasoning (TIR): planning and executing sequences of tool calls in response to complex, often ambiguous user requests, which may be delivered via text or speech. The authors identify two primary obstacles: (1) effective credit assignment in long-horizon, multi-turn interactions, and (2) maintaining exploration during RL training, as models tend to become overconfident and exploitative, reducing their ability to discover improved strategies.

To address these, the paper introduces a sandbox environment for RL-based agent training, a novel Turn-level Adjudicated Reinforcement Learning (TARL) framework for fine-grained credit assignment, and a mixed-task curriculum that incorporates mathematical reasoning to sustain exploration. The approach is validated on both text and speech-based tool-use tasks, demonstrating significant improvements over strong RL baselines.

Sandbox Environment and Agent Architecture

The sandbox environment is designed to support realistic, multi-turn interactions between agents, users, and tools, with support for both text and speech modalities. The environment consists of a backend application with a relational database, a user simulator leveraging LLMs (e.g., GPT-4), and a rule-based verifier for trajectory evaluation. Figure 1

Figure 1: The environment setup for interactive tool-use agents, supporting both text and speech modalities for user simulation and tool execution.

Agents interact with the environment by generating tool calls and receiving feedback from either simulated users or tool execution results. The agent's policy backbone is an autoregressive LLM, and the state is the token sequence observed so far. The environment alternates between agent actions and environment feedback, supporting interleaved speech-text rollouts for multimodal training.

RL Algorithms and Credit Assignment

The agent training is formulated as an MDP, with the objective of maximizing expected reward over complete trajectories. The paper benchmarks several RL algorithms:

  • PPO: Standard token-level policy gradient with clipping.
  • GRPO: Group-based normalization of rewards for improved stability.
  • RLOO: Leave-one-out baseline for variance reduction.

A key insight is that vanilla RL with trajectory-level rewards leads to sparse credit assignment, especially problematic in long, multi-turn dialogues. This results in suboptimal learning, as the agent cannot easily attribute success or failure to specific actions.

Turn-level Adjudicated Reinforcement Learning (TARL)

To address the credit assignment problem, the authors propose TARL, which employs an LLM-based judge to provide turn-level rewards. After each turn, the judge evaluates the agent's action in the context of the full conversation and ground-truth annotations, assigning a reward of −1-1, $0$, or $1$ per turn, with a single −1-1 allowed per trajectory to penalize major deviations. Figure 2

Figure 2: The LLM-based judge provides turn-level rewards by assessing each turn in the context of the full conversation and ground-truth annotations.

The final reward for a trajectory is a combination of the terminal (task completion) reward and the sum of scaled turn-level rewards, yielding a more informative and discriminative signal for policy optimization. The integration of TARL into GRPO and PPO is straightforward for the former (sum over trajectory), while for PPO, the authors find that applying the aggregated trajectory-level reward uniformly across all tokens is more stable than per-turn assignment.

Mixed-Task Training for Exploration

To counteract the reduction in exploration observed during RL training, the authors introduce a mixed-task curriculum that interleaves medium-difficulty math problems (from DeepScaleR) with tool-use tasks. This leverages the base model's pretraining on mathematical reasoning, encouraging longer chain-of-thought (CoT) trajectories and increased self-reflection, which empirically leads to more exploratory behavior.

Experimental Results: Text-Based Agents

The proposed methods are evaluated on the Retail domain of the Ï„\tau-bench environment, using Qwen3-8B as the base model. The main findings are:

  • TARL yields consistent improvements over vanilla RL baselines, with a 2.6–3.5% absolute gain in passk^k metrics.
  • Mixed-task training with TARL achieves a 6% relative improvement over GRPO and a 15% improvement over the base model, reaching 57.4% pass1^1.
  • Models trained with TARL and mixed-task curriculum exhibit longer responses and more frequent self-correction, as measured by the number of "wait" tokens and response length. Figure 3

    Figure 3: PPO-based training strategies show that trajectory-level reward assignment with turn-level evaluation (TARL traj-level) achieves the best performance.

    Figure 4

    Figure 4: Mixed-task training with fine-grained turn-level evaluation (GRPO+Math+TARL) achieves the highest average reward and response length, indicating enhanced exploration and generalization.

Multimodal Agent Training and Evaluation

The framework is extended to train agents on interleaved speech-text rollouts using Qwen2.5-Omni-7B as the base model. Speech-based user simulation is achieved via SeedTTS. The authors employ a curriculum learning phase to warm up the agent's tool-use abilities, followed by mixed-modality training alternating between math, text-only, and speech-text tasks.

Key results:

  • GRPO+Math+TARL improves pass1^1 by over 20% compared to the base model in both text and speech evaluation modes.
  • Training exclusively on text degrades speech-mode performance, underscoring the necessity of mixed-modality training for robust voice agents.

Analysis of Training Strategies

The paper provides a detailed analysis of reward granularity and exploration incentives:

  • Trajectory-level reward assignment outperforms per-turn assignment in PPO, as the latter leads to unstable training and degraded performance.
  • Mixed-task training increases exploration (longer responses), but only yields substantial generalization gains when combined with TARL.
  • Entropy-based loss modifications and real-time intervention via LLM-based verifiers do not improve performance and can destabilize training, highlighting the importance of simple, robust reward structures. Figure 5

    Figure 5: Comparison of GRPO and GRPO+Intervention shows that real-time intervention destabilizes training, as evidenced by increased KL loss and reduced entropy.

Implications and Future Directions

The results demonstrate that process-supervised RL with fine-grained, LLM-judged turn-level rewards and mixed-task curricula can substantially improve the reliability and generalization of tool-use agents in both text and speech modalities. The findings suggest that:

  • Simple, robust reward structures and curriculum design are more effective than complex interventions for long-horizon, multi-turn RL in LLM agents.
  • Mixed-modality training is essential for maintaining and improving speech understanding and tool-use capabilities in multimodal foundation models.
  • The approach is generalizable and can be extended to more diverse domains and more challenging user-agent-tool interaction environments as benchmarks evolve.

Theoretically, the work advances the understanding of credit assignment and exploration in RL for LLM-based agents, providing empirical evidence that process-level supervision (via LLM judges) is a scalable and effective alternative to purely outcome-based or rule-based reward modeling.

Conclusion

This paper presents a comprehensive framework for training interactive, multimodal tool-use agents via process-supervised RL. By combining a sandbox environment, turn-level adjudicated rewards from LLM judges, and a mixed-task curriculum, the approach achieves strong improvements over established RL baselines in both text and speech settings. The results highlight the importance of fine-grained credit assignment and sustained exploration, and provide a foundation for future research on robust, generalizable agentic LLMs capable of complex, real-world tool use.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 2 posts and received 12 likes.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube