User Simulation with LLM Agents

Updated 13 November 2025

User simulation via LLM agents is a method that uses large language models with modular, memory-augmented architectures to mimic diverse human behaviors in controlled environments.
The simulation protocol employs turn-based cycles where agents use context-aware prompts and reflection modules to adapt their actions dynamically.
Evaluation focuses on metrics like evasion rate and believability while addressing challenges such as prompt dependency, memory limits, and scalability.

User simulation via LLM agents refers to the practice of employing LLMs as artificial users in multi-agent environments to reproduce, study, or augment human behaviors across diverse domains, including social communication, education, web search, product interaction, dialogue systems, and teamwork. This approach enables rigorous analysis, large-scale experimentation, and system testing where human data is scarce, costly, or impractical to collect.

1. Core System Architectures and Multi-Agent Designs

LLM-based user simulation systems instantiate agents as wrappers around LLMs, each endowed with explicit roles, memory, and decision-making modules. Two principal agent types are prevalent:

Participant/User Agents: Simulate human users, each with a profile, memory (usually layered as sensory, short-term, and long-term), and a policy module for generating responses or actions based on history and environment context.
Supervisor/Moderator Agents: Enforce rules, policies, or constraints in scenarios involving regulation or instruction-following (e.g., simulating content moderation (Cai et al., 5 May 2024)).

A typical multi-agent architecture includes the following modules for each user agent:

Profile module: Encodes static/user-specific attributes (demographics, interests, psychological traits).
Memory module: Aggregates past interactions, environmental observations, and reflections, often stratified by recency and importance.
Planning/dialogue/action module: Employs prompt engineering to generate or select next actions/utterances.
Reflection/adaptation module: Periodically analyzes violation logs or feedback to adjust future strategies and plans.

Example agent interaction graphs (G = (V, E)) formalize communication flows between participants and supervisory agents, facilitating granular analysis of message passing, flagging, and adaptation (Cai et al., 5 May 2024).

2. Simulation Protocols, Prompting, and Learning Loops

The simulation protocol typically operates as a discrete-time, turn-based loop orchestrated by an environment manager:

At each turn, agents receive context (e.g., user profile, memory summary, history of prior actions) and execute one of several possible moves (e.g., composing an utterance, requesting information, providing feedback).
Prompt templates are structured to incorporate current scenario context, regulation constraints, past violations, and ordering of actions. For example, the Dialogue Module prompt may contain: background info, last k dialogue turns, current plan, and explicit instructions to avoid forbidden content (Cai et al., 5 May 2024).
Supervisory feedback is immediately logged, and violation history is appended to agent memory for subsequent reflection.
Reflection and planning modules iteratively update agent behavior by reviewing what succeeded or was flagged, inducing an evolution of communication strategies (coded language, metaphors, analogies) over simulation rounds.

In more complex settings (e.g., education with psychologically grounded agents (Yuan et al., 7 Aug 2025)), agents' knowledge states $K_t \in \mathbb{R}^m$ and growth are formally tracked, with learning rates and forgetting dynamics parameterized by profile.

Reward formulations (when explicit) balance task success (e.g., information evasion or retrieval, acquisition of correct concepts) and penalties (e.g., being flagged, making mistakes), though in many frameworks, adaptation is achieved through chain-of-thought reflection and memory-based selection rather than explicit policy gradients.

3. Modeling Language Evolution, Believability, and Behavioral Diversity

A central aim in user simulation via LLM agents is the emergence of realistic, diverse, and adaptive behaviors:

Language evolution under regulatory pressure is modeled as an iterative process in which participant agents, forced by violation feedback, progress from using direct synonyms to employing metaphors, analogies, slang, or domain-specific codes to convey information without detection (Cai et al., 5 May 2024). LLM sampling randomness supplies the mutation operation, and selection occurs by preferentially adopting successful strategies identified via reflection.
Behavioral believability is evaluated by comparing simulated outputs to human references. Metrics include turn-level accuracy, plausibility (as rated by human judges or discriminators), and the ability to resist detection or convey information under constraint (evasion rate, comprehension score, etc.).
Personality and diversity are induced by varying prompt instructions (e.g., deep vs. surface vs. lazy learner (Yuan et al., 7 Aug 2025)), profile attributes, sampling temperatures, and guidance. Diversity of generated needs, ideas, or responses is quantifiable using embedding-based measures: convex hull volume, distance-to-centroid, and silhouette score (Ataei et al., 4 Apr 2024).

4. Evaluation Methodologies and Domains

Evaluation strategies are scenario-dependent and typically target both quantitative and qualitative aspects:

Scenario / Domain	Key Metrics / Benchmarks	Notable Findings
Language evasion (regulation)	Evasion rate, comprehension score, dialogue attempts	LLM agents evolve robust coded communication, model capacity impacts speed/fidelity (Cai et al., 5 May 2024)
Classroom/education simulation	Monthly accuracy, trap detection, self-concept, CoI/FIAS	Deep learners alone sustain cumulative growth (Yuan et al., 7 Aug 2025, Zhang et al., 27 Jun 2024)
Requirements elicitation	Latent-need discovery, diversity metrics	Serial agent generation maximizes coverage; CoT aids latent-need detection (Ataei et al., 4 Apr 2024)
Dialogue systems (DST)	Joint goal accuracy, slot F1	LLM-generated data meaningfully augments real data (Niu et al., 17 May 2024)
Usability/user experience	Task success, steps, session time, help/trust ratings	Agents enable rapid parallel pilot-testing, though realism of skimming and gist-taking is limited (Lu et al., 13 Apr 2025)
Team/embodied interaction	Task completion, coordination, believability	Agents reveal gaps in emergent leadership and collaborative adaptation (Almutairi et al., 9 Oct 2025, Philipov et al., 31 Oct 2024)

Common techniques include human-in-the-loop ratings, ablation studies, adversarial/hybrid experiments, and alignment of aggregate simulation outcomes with real-world distributions (state-by-state predictions in elections (Zhang et al., 14 Apr 2025), population-level behavioral curves).

5. Representative Use Cases and Key Strategies

LLM-based user simulation frameworks are instrumental in several high-impact contexts:

Adversarial input generation for robustness testing: By simulating users that can adaptively evade content moderation (via metaphor, analogy, etc.), these agents enable pre-deployment stress-testing of platform policies (Cai et al., 5 May 2024).
Scalable behavioral data generation: Synthetic user-agent traces augment or replace expensive-to-obtain real interaction datasets for recommender training, dialogue state tracking, information retrieval, and educational applications (Ren et al., 27 Feb 2024, Niu et al., 17 May 2024).
User-centered design and requirements elicitation: Pools of diverse, context-aware user agents simulate product interaction scenarios, answer follow-up interviews, and reveal latent needs more efficiently than traditional empathic lead-user human interviews (Ataei et al., 4 Apr 2024).
Sociolinguistic and social science research: Large-scale simulations with demographically sampled LLM agents enable the study of language evolution, conformity, herding, and policy impact at population scale (Zhang et al., 14 Apr 2025, Liu et al., 12 Dec 2024).

Salient implementation strategies include modular, highly parameterized agent generation (favor serial, context-aware LLM calls for diversity), explicit violation/memory logs to facilitate language or strategy evolution, reflective and iterative prompt chains for plan adaptation, and quantitative as well as qualitative alignment verification with human baselines.

6. Limitations, Challenges, and Open Directions

Despite notable empirical advances, several persistent limitations and open challenges remain:

Fidelity and overfitting to prompt structure: Simulated behaviors are highly prompt-dependent and may not robustly generalize across LLM backends. Turn-taking, abstraction, and subtle cues (especially in embodied or multimodal settings) are less accurately reproduced (Cai et al., 5 May 2024, Philipov et al., 31 Oct 2024).
Memory and long-term adaptation: Most systems rely on bounded memory windows; hierarchical or episodic memory, and mechanisms for true long-term adaptation and forgetting, are rare but promising (Liu et al., 12 Dec 2024, Liu et al., 19 Feb 2025).
Evaluation gaps: There is no universal metric suite for behavioral realism across interaction, memory, and social emergence. Human-in-the-loop Turing-style evaluations, paired with entropy and diversity metrics, are current practice but lack standardization (Peng et al., 14 Feb 2025, Ataei et al., 4 Apr 2024).
Scalability and cost: Large populations of agents incur linear cost growth in API tokens and latency; hybrid approaches (diffusion-model + LLM core) have been proposed for tractability in social simulations (Li et al., 18 Oct 2025).
Ethics and risk of over-reliance: Agent outputs may not capture human biases or out-of-training-distribution behaviors, and may encourage premature substitution for real human participant studies without proper validation (Lu et al., 13 Apr 2025).
Open research domains: Integrating more complex cognitive models (emotion, trust, persuasion), richer multi-modal grounding, event-driven (continuous-time) simulation, and adaptive, fine-tuned policies remain major directions for future research.

In summary, user simulation via LLM agents operationalizes modular, memory-augmented, and adaptive multi-agent systems capable of reproducing a broad spectrum of human communication and decision-making patterns, including the evolution of coded language under constraint, educational progression, social conformity, and product interaction. Such frameworks are now fundamental both for controlled scientific inquiry into human–machine and human–human interaction and for building robust, adaptive, and user-aligned AI systems.