UserVille: Simulation & Evaluation Framework

Updated 8 November 2025

UserVille is a configurable simulation environment with dynamic LLM-driven user simulators that model diverse preferences for realistic agent training.
It integrates a PPP reinforcement learning framework that optimizes productivity, proactivity, and personalization to enhance agent robustness and adaptability.
Empirical results on software engineering and research tasks demonstrate that PPP agents outperform baselines by effectively managing ambiguous instructions and diverse user personas.

UserVille is a configurable simulation environment and evaluation framework for interactive language agents, centered on user preference modeling, preference-aware reinforcement learning, and the systematic measurement of agent productivity, proactivity, and personalization. Designed to fill the gap between static agent benchmarks and real-world user interaction requirements, UserVille provides agents with dynamic, LLM-driven user simulators that possess diverse and explicitly parameterizable preferences, allowing agents to be trained and evaluated not only for task completion, but also for their strategic interaction behaviors and personalization capabilities. The associated PPP (Productivity, Proactivity, Personalization) multi-objective RL framework capitalizes on UserVille's environment structure to produce LLM-based agents demonstrating significantly enhanced robustness, user-adaptive behavior, and generalization to unseen personas, as shown through empirical results on complex software engineering and research tasks (Sun et al., 4 Nov 2025).

1. Core Design Principles and Environment Architecture

UserVille is architected as a three-stage simulation environment for training and evaluating LLM-based agents:

Prompt Vaguenization: Task instructions from standard agent evaluation benchmarks are automatically transformed by an LLM into “vague” prompts that omit crucial specification details, producing realistic information asymmetry found in real-world scenarios—agents must detect underspecification and proactively seek clarification.
Preference-Aware LLM User Simulators: Users in UserVille are modeled by LLM-based simulators, parameterized by explicit preference settings. Twenty distinct preferences encapsulate a spectrum of communication, language, format, and interaction constraints (e.g., response style, language, formatting, brevity, humor, structure).
User-Centric Scoring and Rewarding: At session end, user simulators provide multi-dimensional feedback, evaluating the session on productivity (task completion), proactivity (strategic and effective clarification), and personalization (preference compliance). Reward signals are designed for RL optimization and capture both objective achievements and subjective user satisfaction.

The environment supports both training (via simulation) and evaluation (via held-out user preference splits and swapped LLM simulators for generalization studies).

2. User Preference Modeling and Simulator Diversity

UserVille’s user simulation layer encodes rich, structured user preferences as fixed, randomly sampled personas for each episode. The preference set includes, for example:

Interaction scope: concise_question (“questions must be short”), detailed_contextual_question, multiple_questions, answer_at_beginning, questions_needed (“minimize unnecessary questions”)
Format and language: answer_mc (“answer only multiple choice”), lang_ita, caps (“ALL CAPS”), json (“question must be in JSON”)
Stylistic requirements: include_joke, fixed_length (e.g., exactly three sentences)

In each trajectory, the user simulator LLM is initialized with the specific persona settings and is responsible for both answering agent queries (when necessary) and delivering post-hoc scoring (rule-based, LLM-judge, or both). Twelve preferences are available at training time; eight are sequestered for generalization testing.

Configurations allow systematic studies of agent adaptation and robustness to diverse—and in part adversarial—user communication styles.

3. Proactive and Personalized Multi-Objective RL: The PPP Framework

PPP defines a reinforcement learning objective that explicitly incorporates all user-centered dimensions crucial for interactive agent deployment:

Productivity ( $R_\mathrm{Prod}$ ): Quantifies task completion, e.g., code passing unit tests, F1 or EM task accuracy.
Proactivity ( $R_\mathrm{Proact}$ ): Measures agent ability to ask clarifying questions at critical junctures. User simulators assess each query for required effort (Low/Medium/High), penalizing unnecessary, high-effort queries and penalizing omission when the agent fails to clarify crucial ambiguities.
Personalization ( $R_\mathrm{Pers}$ ): Assesses how closely the agent’s interaction matches the user’s declared preferences, via explicit rule-based scoring (e.g., JSON, Italian, brevity).

The overall reward is additive: $R = R_\mathrm{Prod} + R_\mathrm{Proact} + R_\mathrm{Pers}$ RL optimization is performed using a token-level clipped policy gradient with reward normalization: $\mathcal{J} = \mathbb{E}_{q,\{\tau_i\}} \frac{1}{\sum_{i=1}^G |\tau_i|} \sum_{i=1}^G\sum_{t=1}^{|\tau_i|} \min \left\{ r_{i,t}(\theta) \hat{A}_{i,t}, \operatorname{clip}\left(r_{i,t}(\theta), 1-\epsilon, 1+\epsilon \right) \hat{A}_{i,t} \right\}$ Where advantage, $\hat{A}_{i,t}$ , is normalized within minibatches to stabilize learning.

4. Empirical Performance, Personalization, and Generalization

On software engineering (SWE-Bench Func-Loc, SWE-Bench Full) and research (BrowseComp-Plus) agent benchmarks, agents trained with PPP-augmented RL in UserVille achieve marked improvements:

Average PPP Score: Up to 21.6 points above GPT-5 baselines.
Submetric Breakdown (SWE-Bench-Func-Loc, PPP agent): Productivity 56.26, Proactivity 75.55, Personalization 89.26 (vs. GPT-5 at 40.40/53.17/55.85).
Ablation: Removing Proactivity or Personalization components directly reduces the related subscore and also impairs task success, demonstrating mutual benefit in optimizing all three objectives.

PPP agents exhibit learned behavior such as asking concise, low-effort clarifications only when necessary (rather than indiscriminate querying), and strictly adapting question/response form to the active persona (e.g., switching language or output format).

Robust generalization is demonstrated:

Unseen user personas at test time produce minimal performance degradation.
Swapping user-simulator LLM (test-time) yields only minor performance change, indicating the framework does not overfit to the quirks of a particular LLM simulator.

5. Environment Features, Implementation, and Development Guidance

UserVille supports systematic studies on agent interaction policies under controlled yet realistic task vagueness, preference drift, and simulator heterogeneity. Practical recommendations for developing agents or companion systems with UserVille include:

Persona Coverage: Train across a broad spectrum of preferences; include rare/complex constraints for anticipated hard-to-handle personas.
Interaction Management: Agents should learn to detect ambiguity, decide when clarification is needed, and modulate their question format and style based on persona.
Personalization Layer: Implement run-time content shaping modules to enforce persona-compliance at the output stage (e.g., apply format transformations as needed, force language switching).
Scalability: UserVille’s preference pool and LLM-based simulation allow scaling evaluation beyond what is feasible with human-in-the-loop studies.

When deploying PPP agents outside the simulation, expect superior task resilience to vague instructions, improved user satisfaction, and increased acceptance across user groups with divergent communication preferences.

6. Broader Impact, Limitations, and Future Directions

The UserVille and PPP combination demonstrates that explicit, multi-dimensional optimization of user-centered metrics is essential for real-world interactive AI agent development. It addresses gaps in conventional benchmarks that measure only task accuracy, yielding agents that are both robust to instruction vagueness and personalized to user needs (Sun et al., 4 Nov 2025).

Limitations include the reliance on LLM-based simulated users for evaluation (which may not capture all nuances of human behavior), and the need for broader, cross-domain validation. Possible extensions involve modeling evolving user preferences, integrating multimodal interaction constraints, and transferring UserVille/PPP methodology to novel application domains requiring rich agent-user co-adaptation.

UserVille sets a new standard for agent evaluation and RL agent development by explicitly foregrounding productivity, proactivity, and personalization in interactive language agent scenarios, with empirical support for both technical gains and practical user-centered impact.

PDF Markdown Chat (Pro)

References (1)

Training Proactive and Personalized LLM Agents (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to UserVille.