Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RoleInteract: Evaluating the Social Interaction of Role-Playing Agents (2403.13679v3)

Published 20 Mar 2024 in cs.CL

Abstract: LLMs have advanced the development of various AI conversational agents, including role-playing conversational agents that mimic diverse characters and human behaviors. While prior research has predominantly focused on enhancing the conversational capability, role-specific knowledge, and stylistic attributes of these agents, there has been a noticeable gap in assessing their social intelligence. In this paper, we introduce RoleInteract, the first benchmark designed to systematically evaluate the sociality of role-playing conversational agents at both individual and group levels of social interactions. The benchmark is constructed from a variety of sources and covers a wide range of 500 characters and over 6,000 question prompts and 30,800 multi-turn role-playing utterances. We conduct comprehensive evaluations on this benchmark using mainstream open-source and closed-source LLMs. We find that agents excelling in individual level does not imply their proficiency in group level. Moreover, the behavior of individuals may drift as a result of the influence exerted by other agents within the group. Experimental results on RoleInteract confirm its significance as a testbed for assessing the social interaction of role-playing conversational agents. The benchmark is publicly accessible at https://github.com/X-PLUG/RoleInteract.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. Qwen technical report. ArXiv, abs/2309.16609.
  2. Cped: A large-scale chinese personalized and emotional dialogue dataset for conversational ai. ArXiv, abs/2205.14727.
  3. Antônio Carlos da Rocha Costa. 2019. A Variational Basis for the Regulation and Structuration Mechanisms of Agent Societies.
  4. S3: Social-network simulation system with large language model-empowered agents. ArXiv, abs/2307.14984.
  5. Krzysztof Garbowicz. 2021. Dilbert2: Humor detection and sentiment analysis of comic texts using fine-tuned bert models.
  6. The design and construction of a Chinese sarcasm dataset. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 5034–5039, Marseille, France. European Language Resources Association.
  7. Patrick Gunkel. 1998. Human kaleidoscope.
  8. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
  9. EmotionLines: An emotion corpus of multi-party conversations. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
  10. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322.
  11. Mistral 7b.
  12. Yan Leng et al. 2023. Do llm agents exhibit social behavior? ArXiv, abs/2312.15198.
  13. Chatharuhi: Reviving anime character in reality via large language model.
  14. Social science microsimulation. Bulletin of Sociological Methodology/Bulletin de Méthodologie Sociologique, 56(1):71–78.
  15. OpenAI. 2022. Introducing chatgpt. Technical report.
  16. OpenAI. 2023. Gpt-4 is openai’s most advanced system, producing safer and more useful responses. Technical report.
  17. Lamp: When large language models meet personalization.
  18. Character-LLM: A trainable agent for role-playing. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13153–13187, Singapore. Association for Computational Linguistics.
  19. Ryan Shea and Zhou Yu. 2023. Building persona consistent dialogue agents with offline reinforcement learning.
  20. Roleeval: A bilingual role evaluation benchmark for large language models. ArXiv, abs/2312.16132.
  21. Chatplug: Open-domain generative dialogue system with internet-augmented instruction tuning for digital human. arXiv preprint arXiv:2304.07849.
  22. Llama 2: Open foundation and fine-tuned chat models.
  23. Klaus G Troitzsch. 1996. Social science microsimulation. Springer Science & Business Media.
  24. Characterchat: Learning towards conversational ai with personalized social support.
  25. Charactereval: A chinese benchmark for role-playing conversational agent evaluation.
  26. Does role-playing chatbots capture the character personalities? assessing personality traits for role-playing chatbots. CoRR, abs/2310.17976.
  27. Does role-playing chatbots capture the character personalities? assessing personality traits for role-playing chatbots. ArXiv, abs/2310.17976.
  28. Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models.
  29. The rise and potential of large language model based agents: A survey. ArXiv, abs/2309.07864.
  30. Exploring large language models for communication games: An empirical study on werewolf. ArXiv, abs/2309.04658.
  31. Memorybank: Enhancing large language models with long-term memory. ArXiv, abs/2305.10250.
  32. Characterglm: Customizing chinese conversational ai characters with large language models. ArXiv, abs/2311.16832.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Hongzhan Chen (6 papers)
  2. Hehong Chen (10 papers)
  3. Ming Yan (190 papers)
  4. Wenshen Xu (3 papers)
  5. Xing Gao (133 papers)
  6. Weizhou Shen (18 papers)
  7. Xiaojun Quan (52 papers)
  8. Chenliang Li (92 papers)
  9. Ji Zhang (176 papers)
  10. Fei Huang (408 papers)
  11. Jingren Zhou (198 papers)
Citations (6)