Sim2Real Gap in User Simulation

Updated 4 July 2026

Sim2Real gap in user simulation is the discrepancy between LLM-based simulated users and real human behaviors, highlighting both behavioral and evaluative mismatches.
Empirical studies on τ‑bench reveal that simulators are overly cooperative, uniformly styled, and misaligned with human feedback, leading to inflated agent success rates.
The introduction of the composite User-Sim Index (USI) offers a quantifiable metric to measure fidelity, guiding improvements in agent development practices.

to=arxiv_search.search 天天中彩票网络json 彩神争霸苹果{"^{^{^{^{^{^{^{^2query2}}}}}}} the Sim2Real Gap in User Simulation for Agentic Tasks^{^{^{^{^{^{^{^2\2}}}}}}} to=arxiv_search.search ӡамjson force_search? to=arxiv_search.search 手机天天中彩票天天中彩票彩金? to=arxiv_search 北京赛车投注彩神争霸充值json {"^{^{^{^{^{^{^{^2query2}}}}}}} gap in user simulation", "max_results": ^{^{^{^{2\2query2^{^{^{^}}}}}}}} to=arxiv_search.search 福利彩票天天ియjson {"^{^{^{^{^{^{^{^2query2}}}}}}} user simulation", "max_results": 3} The Sim2Real gap in user simulation is the mismatch between what happens when agents are trained or evaluated with LLM-based simulated users and what happens with real human users in multi-turn, tool-using, agentic tasks. In this setting, the simulator typically plays two roles: it acts as the interactive user, generating goals, questions, and reactions, and it also acts as an evaluator or reward provider. The gap therefore has both behavioral and evaluative forms. A systematic study on PRESERVED_PLACEHOLDER_^{^{^{^{2query2^{^{^{^-bench}}}}}}} formalized this gap, replaced the benchmark’s default LLM user with 45^{^{^{^{2\2^{^{^{^}}}}}}} real people across ^{^{^{^{2\2^{^{^⁶⁵}}}}}} tasks, and benchmarked 3^{^{^{^{2\2^{^{^{^}}}}}}} LLM simulators, showing that current simulators are excessively cooperative, stylistically uniform, and systematically misaligned with human feedback (&&&^{^{^{^{2query2^{^{^{^&&&).}}}}}}}

^{^{^{^{2\2^{^{^{^.}}}}}}} Conceptualization and taxonomy

Within user simulation, the Sim2Real gap is defined as a discrepancy between simulated-user interactions and real-user interactions under otherwise matched task and agent conditions. The formalization separates the phenomenon into two principal components. The behavioral gap concerns how a simulator behaves as a user: surface style, politeness, verbosity, timing of information revelation, clarification behavior, and reactions to agent errors. The evaluative gap concerns how a simulator, or a rule-based evaluator, scores trajectories relative to human judgments: whether a task was successful, how efficient or natural the interaction was, and whether the interaction would be reused (&&&^{^{^{^{2query2^{^{^{^&&&).}}}}}}}

This framing matters because the two gaps compound. Unrealistic user behavior produces unrealistic trajectories; misaligned evaluation then mis-scores those trajectories. In the now-standard paradigm exemplified by PRESERVED_PLACEHOLDER_^{^{^{^{2\2^{^{^{^-bench,}}}}}}} LLMs are embedded directly in the evaluation loop as user simulators and evaluators, yet they are often treated as implicitly ground-truth. The central methodological move in the formalization is therefore to replace that assumption with measurement: a taxonomy of gaps, explicit metrics, and a composite fidelity score, the User-Sim Index (USI) (&&&^{^{^{^{2query2^{^{^{^&&&).}}}}}}}

A related line of work frames user-simulation Sim2Real as a distributional gap between the distribution of real user behaviors and the distribution induced by a simulator, emphasizing missed behaviors and hallucinated behaviors at the population level (Mehri et al., 8 May 2026). This suggests that user simulation fidelity cannot be reduced to isolated response plausibility; it is fundamentally about whether the simulator reproduces the heterogeneous behavioral distribution of real users.

2. Experimental instantiation on $\tau$ -bench

The formal study is instantiated on $\tau$ -bench, a customer-service benchmark with two domains: Airline and Retail. Each task contains a textual goal description, a structured database, and policies the agent must respect. Evaluation proceeds by pairing a user with a tool-augmented agent that can inspect and modify the database, and then checking whether the final database state matches the reference outcome. To isolate the Sim2Real gap attributable to user simulation, the protocol keeps the agent and reward fixed and varies only the user: either real humans or different LLM simulators (&&&^{^{^{^{2query2^{^{^{^&&&).}}}}}}}

The human study used 45^{^{^{^{2\2^{^{^{^}}}}}}} adults (Prolific), US-based, across ages ^{^{^{^{2\2^{^{^{^{8–8^{^{^{^2query2}}}}}}}}}}} with median 37, and covered ^{^{^{^{2\2^{^{^⁶⁵}}}}}} $\tau$ -bench tasks. The same ^{^{^{^{2\2^{^{^⁶⁵}}}}}} tasks were run with three independent annotator batches, yielding 495 total human sessions. Each annotator saw the task description and role-playing instructions, chatted with the fixed agent in a web UI until typing /stop, and then completed a post-interaction survey. An LLM-based quality control judge (GPT-5) filtered bad or spammy traces, calibrated with human-labeled examples and reported at precision ^{^{^{^{2query2^{^{^{^.94}}}}}}} for acceptance (&&&^{^{^{^{2query2^{^{^{^&&&).}}}}}}}

The simulator side comprised 3^{^{^{^{2\2^{^{^{^}}}}}}} models across three categories. The proprietary family included GPT, Claude, and Gemini variants; the open-source family included DeepSeek‑V3.^{^{^{^{2\2^{^{^{^,}}}}}}} Llama, Qwen, GPT‑oss, MiniMax, and Kimi variants; and the specialized user models included CoSER‑8B, UserLM‑8B, HumanLike‑7B, and HumanLM‑opinion. For each model, the setup was held constant: the simulator was prompted to act as the user on the same ^{^{^{^{2\2^{^{^⁶⁵}}}}}} tasks with the same agent, and behavioral and evaluative metrics were computed against each of the three human batches, with final numbers reported as mean $\pm$ std across batches (&&&^{^{^{^{2query2^{^{^{^&&&).}}}}}}}

The study mainly fixed the agent to GPT‑5.2, with robustness checks using Gemini‑3.^{^{^{^{2\2^{^{^{^‑Pro}}}}}}}. The rule-based reward remained binary: reward ^{^{^{^{2\2^{^{^{^}}}}}}} if the final database state matched the expected outcome, otherwise ^{^{^{^{2query2^{^{^{^}}}}}}}. Because only the user varied, the protocol supports a direct comparison between human-induced and simulator-induced interaction trajectories (&&&^{^{^{^{2query2^{^{^{^&&&).}}}}}}}

3. Quantification through the User-Sim Index

The User-Sim Index (USI) is a composite ^{^{^{^{2query2^{^{^{^{–^{^{^{^{2\2query2query2}}}}}}}}}}}} score designed to summarize how similar a simulator is to real human users. It combines four behavioral dimensions, an outcome-calibration component, and an evaluative-alignment component. The four behavioral dimensions are constructed from metric-wise S{^{^{^{^{^{^{^{^2\2}}}}}}} similarity, where for a behavioral metric $m$ with simulator aggregate value $M_m$ and human aggregate value $H_m$ ,

$\mathrm{Dice}_m = \frac{2\,\min(M_m,\, H_m)}{M_m + H_m} \times 100.$

If both human and model values are zero, PRESERVED_PLACEHOLDER_^{^{^{^{2\2query2^{^{^{^.}}}}}}} Dimension scores are the mean Dice values across metrics within each dimension (&&&^{^{^{^{2query2^{^{^{^&&&).}}}}}}}

The behavioral dimensions are: D^{^{^{^{2\2^{^{^{^:}}}}}}} Communication styles, D2: Information pattern, D3: Clarification behavior, and D4: Error reaction. D^{^{^{^{2\2^{^{^{^}}}}}}} includes politeness, verbosity, short-turn percentage, acknowledgement rate, repetition, and identity confusion. D2 covers front-loading of information, identifiers per turn, and opening-turn verbosity. D3 covers uncertainty markers, certainty markers, pushback questions, clarification questions, and information-seeking questions. D4 covers emotional markers, accusatory language, and pivot behavior after errors (&&&^{^{^{^{2query2^{^{^{^&&&).}}}}}}}

Outcome calibration is measured through an adaptation of Expected Calibration Error (ECE):

PRESERVED_PLACEHOLDER_^{^{^{^{2\2\2^{^{^{^}}}}}}}

Here PRESERVED_PLACEHOLDER_^{^{^{^{2\22^{^{^{^}}}}}}} is the set of tasks in bin PRESERVED_PLACEHOLDER_^{^{^{^{2\2^{^{^{^3,}}}}}}} PRESERVED_PLACEHOLDER_^{^{^{^{2\2^{^{^⁴}}}}}} is the total number of tasks, and PRESERVED_PLACEHOLDER_^{^{^{^{2\2^{^{^⁵}}}}}} and PRESERVED_PLACEHOLDER_^{^{^{^{2\2^{^{^⁶}}}}}} are simulator and human success rates for the same tasks within the bin. Lower ECE indicates better alignment of task difficulty; for USI, calibration is converted to PRESERVED_PLACEHOLDER_^{^{^{^{2\2^{^{^⁷}}}}}} (&&&^{^{^{^{2query2^{^{^{^&&&).}}}}}}}

Evaluative alignment is measured through Mean Absolute Error (MAE) between LLM-evaluator scores and human post-task survey scores, after mapping ordinal ratings to PRESERVED_PLACEHOLDER_^{^{^{^{2\2^{^{^{^8.}}}}}}} The corresponding score is

PRESERVED_PLACEHOLDER_^{^{^{^{2\2^{^{^⁹}}}}}}

When all components are present, the overall index is

$\tau$ ^{^{^{^{2query2^{^{^{^}}}}}}}

Some models without survey data use a five-component variant averaging D^{^{^{^{2\2^{^{^{^–D4}}}}}}} and calibration (&&&^{^{^{^{2query2^{^{^{^&&&).}}}}}}}

The human inter-annotator ceiling is reported at USI $\tau$ ^{^{^{^{2\2^{^{^{^}}}}}}}, with D^{^{^{^{2\2^{^{^{^}}}}}}} 87.4, D2 97.9, D3 88.^{^{^{^{2query2^{^{^{^}}}}}}}, D4 93.5, Eval 97.4, and ECE ^{^{^{^{2query2^{^{^{^{.^{^{^{^2query2}}}}}}}}}}}. The best simulator overall is DeepSeek‑V3.^{^{^{^{2\2^{^{^{^}}}}}}} with USI $\tau$ 2. The best proprietary model is Gemini‑2.^{^{^{^{2query2^{^{^{^‑Flash}}}}}}} at 73.3. The specialized user models are reported as being in the mid‑6^{^{^{^{2query2^{^{^{^s}}}}}}} or lower, and thus remain far below the human ceiling (&&&^{^{^{^{2query2^{^{^{^&&&).}}}}}}}

4. Behavioral mismatch and the construction of “easy mode”

The behavioral analysis shows that current simulators diverge from real users systematically rather than idiosyncratically. In communication style (D^{^{^{^{2\2^{^{^⁾}}}}}}, simulators are too polite, too verbose, and too uniform. For example, only ^{^{^{^{2\2^{^{^{^{.^{^{^{^2query2}}}}}}}}}}} of GPT‑4o user turns are short ( $\tau$ 3 words) versus 29.^{^{^{^2query2}}} for humans, while 49.^{^{^{^2query2}}} of GPT‑4o turns are marked as polite versus ^{^{^{^{2\2^{^{^{^5.3}}}}}}} for humans. Human conversations vary substantially in length and style, whereas simulators tend to produce polished, full sentences with stable politeness and formality. Some specialized models also exhibit identity confusion; for UserLM‑8B, simulated users can produce agent-like self-identification such as “Hi, I’m a customer service agent…” (&&&^{^{^{^{2query2^{^{^{^&&&).}}}}}}}

In information pattern (D2), simulators front-load and overspecify information. UserLM‑8B includes nearly twice as many identifiers per turn as humans, 4.8 vs. 2.6. Real users often begin with minimal descriptions such as “Hi, I need help with a return under Sarah,” whereas simulators tend to immediately provide structured details. The consequence is reduced ambiguity, reduced incremental disclosure, and fewer authentic disambiguation demands on the agent (&&&^{^{^{^{2query2^{^{^{^&&&).}}}}}}}

In clarification behavior (D3), simulators show miscalibrated uncertainty. GPT‑4o exhibits uncertainty markers in ^{^{^{^{2\2^{^{^{^4.6}}}}}}} of turns versus 7.3^{^{^{^{^{^{^{^2\2}}}}}}} for humans, while UserLM‑8B shows very low uncertainty (3.^{^{^{^2query2}}} but high certainty markers (^{^{^{^{2\2query2^{^{^{^.6}}}}}}} vs. ^{^{^{^{2\2^{^{^{^{.^{^{^{^2query2}}}}}}}}}}} for humans). These patterns do not match human question-asking and commitment behavior, and therefore can mis-shape how an agent learns to clarify, confirm, or infer goals (&&&^{^{^{^{2query2^{^{^{^&&&).}}}}}}}

In error reaction (D4), simulators are notably under-frustrated and over-cooperative. Humans display more accusatory behavior, direct frustration, and blunt repetition, such as “You already asked me that—can you just fix it?” or “Wrong reservation.” By contrast, simulators often respond by politely pivoting to alternative strategies. GPT‑4o and CoSER show higher pivot rates than humans, ^{^{^{^{2\2^{^{^{^{9.^{^{^{^2\2}}}}}}}}}}} and ^{^{^{^{2\2^{^{^{^6.5}}}}}}} versus 8.4^{^{^{^{^{^{^{^2\2}}}}}}}. This deprives the agent of exposure to damaged trust, explicit blame, and repeated complaints (&&&^{^{^{^{2query2^{^{^{^&&&).}}}}}}}

Taken together, these divergences create what the study explicitly terms an “easy mode.” Simulated users overshare useful information, remain politely cooperative after errors, and rarely express realistic friction or disengage. The resulting effect is directly visible in outcome statistics: the human baseline success rate is 63.6^{^{^{^{^{^{^{^2\2}}}}}}}, while many general-purpose simulators yield agent success up to 77.8^{^{^{^{^{^{^{^2\2}}}}}}}. The consequence is not merely cosmetic. If an agent appears robust under LLM-simulated users, that does not imply robustness to real users (&&&^{^{^{^{2query2^{^{^{^&&&).}}}}}}}

A related distributional study on coding and writing tasks reaches a compatible conclusion from a different measurement angle: all evaluated simulators remain far from the real-user distribution, many simulators resemble each other more than they resemble real users, and interpretable analysis shows underproduction of terse and transactional behaviors and overproduction of friendly or enthusiastic behaviors (Mehri et al., 8 May 2026). This suggests that “easy mode” is not only a local stylistic artifact on $\tau$ 4-bench, but also a broader distributional phenomenon.

5. Evaluative misalignment and the inadequacy of binary rewards

The evaluative gap appears both in LLM-as-judge settings and in rule-based reward settings. Human annotators in the $\tau$ 5-bench study completed an eight-dimensional survey after each interaction. The dimensions were Task success, Efficiency, Number of questions, Answer effort, Human-likeness, Interaction flow, Overall performance score, and Reuse intent, together with free-text fields for specific good behaviors, errors, and policy violations. These responses were converted to numerical values in $\tau$ 6 for evaluation-alignment analysis (&&&^{^{^{^{2query2^{^{^{^&&&).}}}}}}}

Using GPT‑5.^{^{^{^{2\2^{^{^{^}}}}}}} as a representative evaluator, the study finds systematic asymmetry. Relative to humans, GPT‑5.^{^{^{^{2\2^{^{^{^}}}}}}} is lenient on user experience but conservative on task completion. It gives higher scores on Human-likeness with $\tau$ 7 and on Reuse intent with $\tau$ 8, but lower scores on Task success with $\tau$ 9. In practical terms, LLM evaluators tend to overestimate how natural or pleasant the interaction feels while underestimating completion relative to human perceptions (&&&^{^{^{^{2query2^{^{^{^&&&).}}}}}}}

The rule-based reward in $\tau$ ^{^{^{^{2query2^{^{^{^-bench}}}}}}} is even narrower. It is a binary signal based solely on whether the final database state exactly matches the expected outcome. The study reports that 7^{^{^{^{2query2^{^{^{^.6}}}}}}} of reward = ^{^{^{^{2query2^{^{^{^}}}}}}} interactions are nevertheless judged successful by humans, including “Yes” or “Fully,” while 33^{^{^{^{^{^{^{^2\2}}}}}}} of reward = ^{^{^{^{2\2^{^{^{^}}}}}}} interactions are judged unsuccessful or only partially successful by humans. Moreover, human quality distributions for overall score, efficiency, and human-likeness are reported as nearly identical for reward=^{^{^{^{2query2^{^{^{^}}}}}}} and reward=^{^{^{^{2\2^{^{^{^}}}}}}} groups (&&&^{^{^{^{2query2^{^{^{^&&&).}}}}}}}

These findings establish that exact-match database state is only one narrow aspect of quality. It does not capture partial success, policy-constrained refusals, alternative valid outcomes, efficiency, intelligibility, or trustworthiness. The rule-based reward therefore cannot substitute for human feedback when the goal is to assess user satisfaction or experience quality (&&&^{^{^{^{2query2^{^{^{^&&&).}}}}}}}

The broader methodological lesson aligns with other Sim2Real settings in which optimizing a proxy can diverge from optimizing real performance. In reinforcement-learning work on Sim2Real, proxy objectives such as predictive fidelity or simulator variability are shown not to necessarily correlate with real-world performance, and simulator adaptation can instead be framed as a bi-level problem that directly optimizes real-world return (Anand et al., 20 Oct 2025). A plausible implication is that user simulation evaluation should likewise be calibrated against real human judgments rather than treated as self-validating.

6. Capability, predictivity, mitigation, and open problems

A notable result is that higher general model capability does not reliably imply better user simulation. The study compares USI with Chatbot Arena Elo and finds that overall assistant quality is not a dependable proxy for simulator fidelity. Outside the GPT series, the correlation between general capability and USI is reported as weak. Some smaller or mid-tier models, including DeepSeek‑V3.^{^{^{^{2\2^{^{^{^}}}}}}} and Llama‑4‑Maverick, achieve higher USI than some more capable general LLMs. The specialized “human-like” models—CoSER‑8B: 66.9, UserLM‑8B: 6^{^{^{^{2\2^{^{^{^.7}}}}}}}, HumanLike‑7B: 59.6, and HumanLM‑opinion: 46.5—also remain far from the human ceiling, showing that fine-tuning for “human-like” responses does not guarantee realistic user behavior as defined by USI (&&&^{^{^{^{2query2^{^{^{^&&&).}}}}}}}

Methodologically, the work contributes the first full human-run $\tau$ ^{^{^{^{2\2^{^{^{^-bench}}}}}}}, a multi-dimensional behavioral operationalization using LIWC2^{^{^{^{2query2\2^{^{^⁵}}}}}}, NRC emotion lexicons, and regex patterns, an adaptation of ECE for outcome calibration, multi-dimensional human quality surveys, and robustness checks across three human batches and an alternative agent (Gemini‑3.^{^{^{^{2\2^{^{^{^‑Pro}}}}}}}) for a subset of tasks (&&&^{^{^{^{2query2^{^{^{^&&&).}}}}}}} These choices locate user-simulation Sim2Real within a broader Sim2Real research tradition that treats domain shift as measurable rather than anecdotal.

Several adjacent literatures sharpen this perspective. In robotics and perception, Sim2Real is often framed as a distribution shift between simulated and real domains, and mitigation can proceed by bringing the training distribution closer to the real one, as in diffusion-based augmentation for pedestrian detection (Farley et al., 2023). In visuomotor transfer, a representation-centric view treats the encoder as the bottleneck, emphasizing task-relevant but domain-invariant features (Biruduganti et al., 26 Jan 2025). In RL, when direct transfer fails, simulation may still be useful for learning exploratory policies that support efficient real-world learning (Wagenmaker et al., 2024). Formal-control work introduces simulation-gap functions and stochastic simulation-gap functions that model the gap as a bounded, state- and input-dependent disturbance, enabling certified transfer under explicit assumptions (Sangeerth et al., 2024, &&&3^{^{^{^{2query2^{^{^{^&&&).}}}}}}} Another strand evaluates simulators by predictivity, using a Sim-vs-Real Correlation Coefficient (SRCC) to ask whether improvements in simulation preserve their ordering in reality (&&&3^{^{^{^{2\2^{^{^{^&&&).}}}}}}} For user simulation specifically, a distributional framework shows that combining behaviorally complementary simulators can bring the mixture distribution closer to real users than either component alone (Mehri et al., 8 May 2026).

Taken together, these results motivate several disciplined practices for agent development. Simulators should be treated as tools, not ground truth; the gap should be quantified before simulator-based evaluation is trusted; and human-in-the-loop validation should remain part of the development cycle. Mixed protocols that combine simulation-heavy iteration with targeted human evaluation on hard cases, new domains, or new user segments are consistent with the evidence. So is the recommendation that next-generation simulators explicitly model information withholding, incremental disclosure, rich emotional reactions, frustration, disengagement, non-uniform politeness, and realistic clarification behavior (&&&^{^{^{^{2query2^{^{^{^&&&).}}}}}}}

The current evidence is also bounded by clear limitations. The principal study is confined to customer service in airline and retail domains; annotators come from a US-based crowd-worker pool; the main agent is GPT‑5.2; and USI, although comprehensive, still compresses complex behavioral and evaluative phenomena into a single index. The work evaluates existing models rather than proposing a new simulator-training procedure. Future directions therefore include extending the framework to other agentic domains, training simulators against empirical human behavior distributions, and examining demographic and sociolinguistic dimensions of the gap (&&&^{^{^{^{2query2^{^{^{^&&&).}}}}}}}

In this sense, the Sim2Real gap in user simulation is not merely a benchmark artifact. It is a general warning that as evaluation moves from static benchmarks to interactive settings, both the trajectories agents experience and the signals used to score them can drift away from human reality. Measuring that drift is therefore a prerequisite for trusting simulation as an instrument of optimization, comparison, or deployment readiness (&&&^{^{^{^{2query2^{^{^{^&&&).}}}}}}}