RoleInteract: Evaluating the Social Interaction of Role-Playing Agents (2403.13679v3)
Abstract: LLMs have advanced the development of various AI conversational agents, including role-playing conversational agents that mimic diverse characters and human behaviors. While prior research has predominantly focused on enhancing the conversational capability, role-specific knowledge, and stylistic attributes of these agents, there has been a noticeable gap in assessing their social intelligence. In this paper, we introduce RoleInteract, the first benchmark designed to systematically evaluate the sociality of role-playing conversational agents at both individual and group levels of social interactions. The benchmark is constructed from a variety of sources and covers a wide range of 500 characters and over 6,000 question prompts and 30,800 multi-turn role-playing utterances. We conduct comprehensive evaluations on this benchmark using mainstream open-source and closed-source LLMs. We find that agents excelling in individual level does not imply their proficiency in group level. Moreover, the behavior of individuals may drift as a result of the influence exerted by other agents within the group. Experimental results on RoleInteract confirm its significance as a testbed for assessing the social interaction of role-playing conversational agents. The benchmark is publicly accessible at https://github.com/X-PLUG/RoleInteract.
- Qwen technical report. ArXiv, abs/2309.16609.
- Cped: A large-scale chinese personalized and emotional dialogue dataset for conversational ai. ArXiv, abs/2205.14727.
- Antônio Carlos da Rocha Costa. 2019. A Variational Basis for the Regulation and Structuration Mechanisms of Agent Societies.
- S3: Social-network simulation system with large language model-empowered agents. ArXiv, abs/2307.14984.
- Krzysztof Garbowicz. 2021. Dilbert2: Humor detection and sentiment analysis of comic texts using fine-tuned bert models.
- The design and construction of a Chinese sarcasm dataset. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 5034–5039, Marseille, France. European Language Resources Association.
- Patrick Gunkel. 1998. Human kaleidoscope.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
- EmotionLines: An emotion corpus of multi-party conversations. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
- C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322.
- Mistral 7b.
- Yan Leng et al. 2023. Do llm agents exhibit social behavior? ArXiv, abs/2312.15198.
- Chatharuhi: Reviving anime character in reality via large language model.
- Social science microsimulation. Bulletin of Sociological Methodology/Bulletin de Méthodologie Sociologique, 56(1):71–78.
- OpenAI. 2022. Introducing chatgpt. Technical report.
- OpenAI. 2023. Gpt-4 is openai’s most advanced system, producing safer and more useful responses. Technical report.
- Lamp: When large language models meet personalization.
- Character-LLM: A trainable agent for role-playing. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13153–13187, Singapore. Association for Computational Linguistics.
- Ryan Shea and Zhou Yu. 2023. Building persona consistent dialogue agents with offline reinforcement learning.
- Roleeval: A bilingual role evaluation benchmark for large language models. ArXiv, abs/2312.16132.
- Chatplug: Open-domain generative dialogue system with internet-augmented instruction tuning for digital human. arXiv preprint arXiv:2304.07849.
- Llama 2: Open foundation and fine-tuned chat models.
- Klaus G Troitzsch. 1996. Social science microsimulation. Springer Science & Business Media.
- Characterchat: Learning towards conversational ai with personalized social support.
- Charactereval: A chinese benchmark for role-playing conversational agent evaluation.
- Does role-playing chatbots capture the character personalities? assessing personality traits for role-playing chatbots. CoRR, abs/2310.17976.
- Does role-playing chatbots capture the character personalities? assessing personality traits for role-playing chatbots. ArXiv, abs/2310.17976.
- Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models.
- The rise and potential of large language model based agents: A survey. ArXiv, abs/2309.07864.
- Exploring large language models for communication games: An empirical study on werewolf. ArXiv, abs/2309.04658.
- Memorybank: Enhancing large language models with long-term memory. ArXiv, abs/2305.10250.
- Characterglm: Customizing chinese conversational ai characters with large language models. ArXiv, abs/2311.16832.
- Hongzhan Chen (6 papers)
- Hehong Chen (10 papers)
- Ming Yan (190 papers)
- Wenshen Xu (3 papers)
- Xing Gao (133 papers)
- Weizhou Shen (18 papers)
- Xiaojun Quan (52 papers)
- Chenliang Li (92 papers)
- Ji Zhang (176 papers)
- Fei Huang (408 papers)
- Jingren Zhou (198 papers)