Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

How Far Are LLMs from Believable AI? A Benchmark for Evaluating the Believability of Human Behavior Simulation (2312.17115v2)

Published 28 Dec 2023 in cs.CL and cs.CY

Abstract: In recent years, AI has demonstrated remarkable capabilities in simulating human behaviors, particularly those implemented with LLMs. However, due to the lack of systematic evaluation of LLMs' simulated behaviors, the believability of LLMs among humans remains ambiguous, i.e., it is unclear which behaviors of LLMs are convincingly human-like and which need further improvements. In this work, we design SimulateBench to evaluate the believability of LLMs when simulating human behaviors. In specific, we evaluate the believability of LLMs based on two critical dimensions: 1) consistency: the extent to which LLMs can behave consistently with the given information of a human to simulate; and 2) robustness: the ability of LLMs' simulated behaviors to remain robust when faced with perturbations. SimulateBench includes 65 character profiles and a total of 8,400 questions to examine LLMs' simulated behaviors. Based on SimulateBench, we evaluate the performances of 10 widely used LLMs when simulating characters. The experimental results reveal that current LLMs struggle to align their behaviors with assigned characters and are vulnerable to perturbations in certain factors.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. Using large language models to simulate multiple humans and replicate human subject studies. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org.
  2. Using large language models to simulate multiple humans and replicate human subject studies. In International Conference on Machine Learning, pages 337–371. PMLR.
  3. Out of One, Many: Using Language Models to Simulate Human Samples. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 819–862. ArXiv:2209.06899 [cs].
  4. BIG bench authors. 2023. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  6. CoMPosT: Characterizing and Evaluating Caricature in LLM Simulations. ArXiv:2310.11501 [cs].
  7. Do llms understand social knowledge? evaluating the sociability of large language models with socket benchmark. arXiv preprint arXiv:2305.14938.
  8. PaLM: Scaling Language Modeling with Pathways. ArXiv:2204.02311 [cs].
  9. Toxicity in chatgpt: Analyzing persona-assigned language models. arXiv preprint arXiv:2304.05335.
  10. Human conversational behavior. Human nature, 8:231–246.
  11. Alice H Eagly and Wendy Wood. 2012. Social role theory. Handbook of theories of social psychology, 2:458–476.
  12. Clark Elliott and Jacek Brzezinski. 1998. Autonomous agents as synthetic characters. AI magazine, 19(2):13–13.
  13. Is chatgpt a highly fluent grammatical error correction system? a comprehensive evaluation. arXiv preprint arXiv:2304.01746.
  14. Emilio Ferrara. 2023. Should chatgpt be biased? challenges and risks of bias in large language models. arXiv preprint arXiv:2304.03738.
  15. Stan Franklin and Art Graesser. 1997. Is It an agent, or just a program?: A taxonomy for autonomous agents. In Intelligent Agents III Agent Theories, Architectures, and Languages, Lecture Notes in Computer Science, pages 21–35, Berlin, Heidelberg. Springer.
  16. S$^3$: Social-network Simulation System with Large Language Model-Empowered Agents. ArXiv:2307.14984 [cs].
  17. PeaCoK: Persona commonsense knowledge for consistent and engaging narratives. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6569–6591, Toronto, Canada. Association for Computational Linguistics.
  18. John J. Horton. 2023a. Large Language Models as Simulated Economic Agents: What Can We Learn from Homo Silicus? ArXiv:2301.07543 [econ, q-fin].
  19. John J Horton. 2023b. Large language models as simulated economic agents: What can we learn from homo silicus? Technical report, National Bureau of Economic Research.
  20. ChatGPT an ENFJ, Bard an ISTJ: Empirical Study on Personalities of Large Language Models. ArXiv:2305.19926 [cs] version: 1.
  21. Jie Huang and Kevin Chen-Chuan Chang. 2023. Towards reasoning in large language models: A survey. In Findings of the Association for Computational Linguistics: ACL 2023, pages 1049–1065, Toronto, Canada. Association for Computational Linguistics.
  22. Evaluating Large Language Models in Generating Synthetic HCI Research Data: a Case Study. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pages 1–19, Hamburg Germany. ACM.
  23. A Roadmap of Agent Research and Development. Autonomous Agents and Multi-Agent Systems, 1(1):7–38.
  24. PersonaLLM: Investigating the Ability of GPT-3.5 to Express Personality Traits and Gender Differences. ArXiv:2305.02547 [cs].
  25. Personallm: Investigating the ability of large language models to express big five personality traits.
  26. Lyfe agents: Generative agents for low-cost real-time social interactions. arXiv preprint arXiv:2310.02172.
  27. Michal Kosinski. 2023. Theory of Mind May Have Spontaneously Emerged in Large Language Models. ArXiv:2302.02083 [cs].
  28. The socialAI school: Insights from developmental psychology towards artificial socio-cultural agents. In First Workshop on Theory of Mind in Communicating Agents.
  29. John Laird and Michael VanLent. 2001. Human-Level AI’s Killer Application: Interactive Computer Games. AI Magazine, 22(2):15–15. Number: 2.
  30. Does GPT-3 Demonstrate Psychopathy? Evaluating Large Language Models from a Psychological Perspective. ArXiv:2212.10529 [cs].
  31. Yen-Ting Lin and Yun-Nung Chen. 2023. LLM-eval: Unified multi-dimensional automatic evaluation for open-domain conversations with large language models. In Proceedings of the 5th Workshop on NLP for Conversational AI (NLP4ConvAI 2023), pages 47–58, Toronto, Canada. Association for Computational Linguistics.
  32. Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172.
  33. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35.
  34. Agentbench: Evaluating llms as agents. arXiv preprint arXiv: 2308.03688.
  35. Charles M Macal and Michael J North. 2005. Tutorial on agent-based modeling and simulation. In Proceedings of the Winter Simulation Conference, 2005., pages 14–pp. IEEE.
  36. Pattie Maes. 1995. Artificial life meets entertainment: lifelike autonomous agents. Communications of the ACM, 38(11):108–114.
  37. Andrew Ortony et al. 2003. On making believable emotional agents believable. Emotions in humans and artifacts, pages 189–211.
  38. Generative agents: Interactive simulacra of human behavior. In In the 36th Annual ACM Symposium on User Interface Software and Technology (UIST ’23), UIST ’23, New York, NY, USA. Association for Computing Machinery.
  39. Social Simulacra: Creating Populated Prototypes for Social Computing Systems. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology, UIST ’22, pages 1–18, New York, NY, USA. Association for Computing Machinery.
  40. Fábio Perez and Ian Ribeiro. 2022. Ignore previous prompt: Attack techniques for language models. In NeurIPS ML Safety Workshop.
  41. Measuring and narrowing the compositionality gap in language models. arXiv preprint arXiv:2210.03350.
  42. Communicative agents for software development. arXiv preprint arXiv:2307.07924.
  43. Can ChatGPT Assess Human Personalities? A General Evaluation Framework. ArXiv:2303.01248 [cs].
  44. Can ChatGPT assess human personalities? a general evaluation framework. arXiv preprint arXiv:2303.01248.
  45. In-context impersonation reveals large language models’ strengths and biases. ArXiv, abs/2305.14930.
  46. Richard T Schaefer. 2008. Encyclopedia of race, ethnicity, and society, volume 1. Sage.
  47. Do massively pretrained language models make better storytellers? In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 843–861, Hong Kong, China. Association for Computational Linguistics.
  48. Maya Sen and Omar Wasow. 2016. Race as a bundle of sticks: Designs that estimate effects of seemingly immutable characteristics. Annual Review of Political Science, 19:499–522.
  49. Character-LLM: A Trainable Agent for Role-Playing. ArXiv:2310.10158 [cs].
  50. Learning to summarize with human feedback. In Advances in Neural Information Processing Systems, volume 33, pages 3008–3021. Curran Associates, Inc.
  51. LLaMA: Open and Efficient Foundation Language Models. ArXiv:2302.13971 [cs].
  52. Towards understanding chain-of-thought prompting: An empirical study of what matters. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2717–2739, Toronto, Canada. Association for Computational Linguistics.
  53. When Large Language Model based Agent Meets User Behavior Analysis: A Novel User Simulation Paradigm. ArXiv:2306.02552 [cs].
  54. RoleLLM: Benchmarking, Eliciting, and Enhancing Role-Playing Abilities of Large Language Models. ArXiv:2310.00746 [cs].
  55. Stanley Wasserman and Katherine Faust. 1994. Social Network Analysis: Methods and Applications. Structural Analysis in the Social Sciences. Cambridge University Press.
  56. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  57. A survey of preference-based reinforcement learning methods. Journal of Machine Learning Research, 18(136):1–46.
  58. Michael Wooldridge and Nicholas R. Jennings. 1995. Intelligent agents: theory and practice. The Knowledge Engineering Review, 10(2):115–152.
  59. The Rise and Potential of Large Language Model Based Agents: A Survey. ArXiv:2309.07864 [cs] version: 1.
  60. Siren’s song in the ai ocean: A survey on hallucination in large language models. arXiv preprint arXiv:2309.01219.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Yang Xiao (149 papers)
  2. Yi Cheng (78 papers)
  3. Jinlan Fu (36 papers)
  4. Jiashuo Wang (19 papers)
  5. Wenjie Li (183 papers)
  6. Pengfei Liu (191 papers)
Citations (3)
Github Logo Streamline Icon: https://streamlinehq.com