Papers
Topics
Authors
Recent
Search
2000 character limit reached

Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces

Published 9 Apr 2026 in cs.CL, cs.AI, and cs.LG | (2604.08362v1)

Abstract: The emergence of LLMs has illuminated the potential for a general-purpose user simulator. However, existing benchmarks remain constrained to isolated scenarios, narrow action spaces, or synthetic data, failing to capture the holistic nature of authentic human behavior. To bridge this gap, we introduce OmniBehavior, the first user simulation benchmark constructed entirely from real-world data, integrating long-horizon, cross-scenario, and heterogeneous behavioral patterns into a unified framework. Based on this benchmark, we first provide empirical evidence that previous datasets with isolated scenarios suffer from tunnel vision, whereas real-world decision-making relies on long-term, cross-scenario causal chains. Extensive evaluations of state-of-the-art LLMs reveal that current models struggle to accurately simulate these complex behaviors, with performance plateauing even as context windows expand. Crucially, a systematic comparison between simulated and authentic behaviors uncovers a fundamental structural bias: LLMs tend to converge toward a positive average person, exhibiting hyper-activity, persona homogenization, and a Utopian bias. This results in the loss of individual differences and long-tail behaviors, highlighting critical directions for future high-fidelity simulation research.

Summary

  • The paper introduces OmniBehavior, a benchmark constructed from real-world Kuaishou logs to capture long-horizon, multi-scenario user behavior with high fidelity.
  • It details a rigorous pipeline including data cleaning, multi-modal fusion, and anonymization, ensuring robust simulation across five diverse scenarios.
  • Evaluation reveals that even state-of-the-art LLMs struggle with long-context reasoning and bias, highlighting the need for advanced memory architectures.

Real-world Human Behavior Simulation: OmniBehavior Benchmark for LLM-based User Simulators

Benchmark Motivation and Construction

LLMs have shown potential for user simulation in interactive systems, but prior benchmarks are confined to narrow, single-scenario datasets or synthetic data, failing to reflect authentic human behavior's complexity. "Towards Real-world Human Behavior Simulation: Benchmarking LLMs on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces" (2604.08362) introduces OmniBehavior, a benchmark built entirely from real-world Kuaishou industrial logs, capturing multi-scenario, long-horizon, and heterogeneous traces. The construction pipeline aggregates data from five major scenarios, applies two-layer data cleaning, multi-modal fusion, and strict anonymization, resulting in a high-fidelity dataset that spans diverse user actions, demographic profiles, and sequence lengths up to 100k steps. This comprehensive benchmark addresses fragmented evaluation settings and sets a rigorous standard for LLM-based user simulation. Figure 1

Figure 1: Overview of OmniBehavior, detailing real-world log collection, cleaning, sampling, and anonymization across scenario domains.

Empirical Analysis of Real User Behavior

The paper presents extensive analyses verifying that authentic human behavior modeling requires multi-scenario and long-horizon data. Profile reconstruction illustrates that single-scenario logs yield fragmented and biased user representations, while multi-scenario aggregation increases information coverage by 20–30%, eliminating tunnel vision. Quantitative tracing of 180 conversion events reveals that 81.8% of causal chains span multiple scenarios and persist over days, substantially exceeding traditional short-session windows. A case study demonstrates that a purchase decision accumulates over a 12-day sequence involving heterogeneous actions, confirming the necessity of benchmarks preserving causal integrity and cross-domain dependencies. Figure 2

Figure 2: User profile reconstruction shows richer, less biased modeling with multi-scenario data.

Figure 3

Figure 3: Cross-scenario causal chain leading to purchase, evidencing long-term dependencies in user decision-making.

Distributional analysis further finds that synthetic data lacks the stochastic, intertwined interest evolution observed in real trajectories, exhibiting mechanical and abrupt shifts instead. This underscores the irreplaceable value of benchmarks grounded in authentic logs.

Benchmark Scope and Diversity

OmniBehavior encompasses five principal scenarios: video browsing, live streaming, advertisement, e-commerce (including customer service dialogue), and search behavior, covering 22 distinct actions. Sequence lengths exhibit high variance, challenging models to reason over short-term and ultra-long interactions. The sampled user population spans diverse genders, age groups, and interests, capturing heterogeneous preferences and backgrounds essential for robust simulation. Figure 4

Figure 4: Demographic and behavioral distributions evidence diversity across user population and interests.

Figure 5

Figure 5: OmniBehavior benchmark schema requires agents to predict heterogeneous behaviors across scenario-specific contexts.

Evaluation of SOTA LLMs

A comprehensive evaluation of both closed-source (Claude-4.5-Opus, Gemini-3-Flash, GPT-5.2, etc.) and open-source (GLM-4.7, DeepSeek-V3, Qwen3-235B, etc.) LLMs on OmniBehavior reveals significant limitations in high-fidelity simulation. Even the best model, Claude-4.5-Opus, achieves an overall score of 44.55, with F1 scores not exceeding 40% for binary actions, indicating that instruction-tuned LLMs are insufficient for capturing long-tail, stochastic, and cross-scenario dependencies. Notably, larger context windows (up to 128K tokens) do not yield consistent improvement; current architectures struggle in long-context reasoning and require advanced memory management.

Structural Bias in LLM Simulation

A systematic comparison reveals a fundamental structural bias termed positivity-and-average bias. LLM-based simulators systematically overestimate action probabilities (hyper-activity bias), erasing negative feedback signals critical for applications like churn prediction. Utterance and sentiment analysis demonstrate a pronounced "Utopian Tendency": LLMs generate highly polite, positive, and conflict-avoiding language, suppressing adversarial and dissatisfied interactions prevalent in real-world logs. Figure 6

Figure 6: Real user vocabulary reveals friction; LLM simulators default to polite, sanitized language.

Quantitative behavioral analysis shows severe persona homogenization: distributions of intra- and inter-user behavioral distances substantially overlap in simulated populations, unlike the distinct separation found in real users. This convergence toward a generic "average person" undermines the modeling of individual differences and long-tail behaviors, limiting ecological validity. Figure 7

Figure 7: Comparison of behavioral distance distributions; LLM simulators homogenize user personas compared to authentic populations.

Implications and Future Directions

OmniBehavior exposes structural limitations of current LLMs for user simulation at scale. Practically, these findings undermine the reliability of LLM simulators for realistic interactive system evaluation, recommender testing, and behavioral modeling in social sciences. Theoretically, the results highlight deficiencies in long-context reasoning, causal modeling, and representation of behavioral heterogeneity. Addressing these gaps requires advances in memory architectures, structure-aware context integration, calibration for negative and long-tail signals, and techniques to counter alignment-induced persona homogenization. Future developments will likely focus on integrating real-world feedback, causal chain preservation, and explicit modeling of temporal dynamics and adversarial user patterns.

Conclusion

OmniBehavior delivers a new standard for high-fidelity, cross-scenario, long-horizon user simulation benchmarking, revealing substantial capability gaps and structural biases in contemporary LLMs. The benchmark enables rigorous evaluation and will guide future research in modeling authentic, heterogeneous, and causally rich human behaviors. The findings stress the importance of grounding LLM-based simulators in real-world data, with attention to preserving behavioral diversity and causal structure for robust deployment in AI-driven interactive systems.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.