UserRL Framework for Interactive Agents

Updated 28 September 2025

UserRL is a unified framework that combines standardized gym environments with simulated user interactions to train multi-turn, user-centric agents.
It employs diverse reward shaping and trajectory scoring methods, such as Reward-to-Go and exponential mapping, to enhance agent learning.
The framework leverages both cost-effective open-source models and high-fidelity simulated users to optimize reinforcement learning in complex interactive tasks.

UserRL is a unified framework for training and evaluating interactive, user-centric intelligent agents through reinforcement learning in standardized gym environments with simulated user interaction. Designed to address the diversity, dynamics, and challenges of real-world user collaboration, UserRL systematically integrates reward assignment schemes, trajectory-level scoring, and simulator fidelity to advance robust agentic models beyond static benchmarks (Qian et al., 24 Sep 2025).

1. Framework Structure and Components

UserRL is architected around a collection of task-specific gym environments, each modeling distinct user-centric abilities including intent clarification, creative reasoning, persuasion, tool usage, and travel planning. Each environment implements a rule-based task automaton (managing environment progression via deterministic reset() and step() functions) interleaved with simulated user responses generated by an LLM. This dual design enables both rigorous control via environment automata and realism via LLM–driven user feedback.

Agents interact with these environments through a standardized tool interface ("interact_with_env") supporting three operations:

Action: Engage the user directly (e.g., seeking clarification).
Search: Invoke retrieval or tool actions.
Answer: Submit candidate solutions or finalize decisions.

This abstraction facilitates seamless benchmarking and RL training across diverse multi-turn interactive tasks.

2. Simulated User Integration and Role

A key technical component is the simulated user, implemented via LLM calls (e.g., Qwen3-32B or GPT-4o) to provide context-aware, dynamic responses. During training, an open-source model such as Qwen3-32B is preferred for cost-effective large-scale simulation. Evaluation and final model deployment utilize a stronger simulated user (like GPT-4o) for more realistic interaction fidelity and improved downstream performance. This design enables scalable training and robust transfer to higher-quality environments without incurring prohibitive computational cost.

The simulated user acts as evaluator, responding to agent actions with feedback aligned to predefined rules or realistic behaviors, thereby driving the learning signal for reinforcement updates.

3. Reward Assignment and Trajectory-Level Scoring

UserRL systematically explores reward shaping methodologies at both the turn and trajectory levels:

Turn-Level Reward Shaping: Raw rewards $r_t$ $r_{t}$ (per turn) are transformed by one of several schemes:
- Naive: $\tilde{r}_t = r_t$
- Equalized: constant reward per turn
- Reward-to-Go (R2G): $\tilde{r}_t = \sum_{j=t}^T \gamma^{j-t} r_j$
- Exponential Mapping (EM): $\tilde{r}_t = 0.5 + 0.5 \cdot \frac{1 - \exp(-k \cdot r_t)}{1 - \exp(-k)}$ (for $k > 0$ )
Trajectory-Level Score Calculation: Aggregates rewards into a scalar for grouped advantage estimation:
- Sum: $R_{\text{traj}}^{\text{sum}}(\tau) = \sum_{t=1}^T r_t$
- R2G: $R_{\text{traj}}^{\text{r2g}}(\tau) = \sum_{j=1}^T \gamma^{j-1} r_j$

Grouped trajectories $G_Q$ for a given query $Q$ are normalized:

$A(x_{t,k} | Q) = \frac{\tilde{r}_t - \mu_Q}{\sigma_Q + \eta}$

with $\mu_Q$ and $\sigma_Q$ representing mean and standard deviation of $R_{\text{traj}}$ in $G_Q$ ; $\eta$ is a small constant.

The PPO-style update (excluding KL penalty) is applied using these estimates, enabling adaptive multi-turn credit assignment.

4. Training Methodology and Key Experimental Insights

UserRL is trained under the Group-Relative Policy Optimization (GRPO) algorithm, with strong emphasis on initial supervised fine-tuning (SFT) for cold start. This SFT initialization is found to be critical: it unlocks baseline interaction capabilities and prevents early plateaus in RL optimization.

Three central findings from experiments on Qwen3 models:

SFT is critical for effective RL: SFT cold start enables sustained RL improvement and avoids stagnation, with reported gains exceeding 100% versus raw RL initialization.
Trajectory-level scoring, especially R2G, is decisive: Combining Equalized turn-level rewards with R2G trajectory aggregation delivers the most efficient multi-turn interaction improvements.
Strong simulated users accelerate training and yield superior models: Training with powerful simulating agents (e.g., GPT-4o) provides faster convergence and higher performance, yet open-source models (Qwen3-32B) remain sufficient for cost-effective development and transfer.

5. Practical Implications and Benchmarking

The unified design of gym environments, coupled with standardized tool interfaces and dynamic simulated user integration, establishes UserRL as a practical foundation for training interactive, user-assistive models. Key practical implications include:

Flexible reward shaping and trajectory scoring directly influence agent learning, encouraging efficient accumulation of progress and effective multi-turn dynamics.
Simulator fidelity is an essential axis: balancing computational resource with training realism affects both speed and transferability.
The framework's extensibility and benchmarking capability facilitate systematic comparison across diverse user-centric skills.

All code and data resources are publicly available via https://github.com/SalesforceAIResearch/UserRL, supporting reproducibility and further research.

6. Formalization and Mathematical Foundations

UserRL formalizes multi-turn user-centric RL as a sequence of states, actions, and rewards $(s_1, a_1, r_1), ..., (s_T, a_T, r_T)$ , with reward shaping governed by:

$\tilde{r}_t = \sum_{j=t}^{T} \gamma^{j-t} r_j$

and trajectory scoring by:

$R_{\text{traj}}^{\text{r2g}}(\tau) = \sum_{j=1}^T \gamma^{j-1} r_j$

Group-wise advantage $A(x_{t,k} | Q)$ is normalized by the mean and variance over batches, supporting stable and robust PPO updates.

GRPO is adapted to this environment, enabling grouped normalization and trajectory-level reward propagation, tailored for user-centric multi-turn interactions.

7. Research Impact and Future Directions

By formalizing and empirically dissecting reward shaping, scoring, and simulator fidelity as key levers, UserRL provides a rigorous methodology for advancing agentic models in interactive user environments. The framework establishes that reward assignment schemes and user simulation choices are as pivotal as model scale for scalable multi-turn RL.

A plausible implication is that systematically varying these factors can further unlock efficient RL in human-facing, dynamic contexts. Future research may extend UserRL to real user interaction, richer conversation tasks, and broader agentic capabilities.

UserRL thus serves as a foundation for robust, generalizable user-centric RL research, advancing capabilities in collaborative assistance, tool use, and complex interactive reasoning.

PDF Markdown Chat (Pro)

References (1)

UserRL: Training Interactive User-Centric Agent via Reinforcement Learning (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to UserRL Framework.