Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RoleEval: A Bilingual Role Evaluation Benchmark for Large Language Models (2312.16132v2)

Published 26 Dec 2023 in cs.CL

Abstract: The rapid evolution of LLMs necessitates effective benchmarks for evaluating their role knowledge, which is essential for establishing connections with the real world and providing more immersive interactions. This paper introduces RoleEval, a bilingual benchmark designed to assess the memorization, utilization, and reasoning capabilities of role knowledge. RoleEval comprises RoleEval-Global (including internationally recognized characters) and RoleEval-Chinese (including characters popular in China), with 6,000 Chinese-English parallel multiple-choice questions focusing on 300 influential people and fictional characters drawn from a variety of domains including celebrities, anime, comics, movies, TV series, games, and fictions. These questions cover basic knowledge and multi-hop reasoning abilities, aiming to systematically probe various aspects such as personal information, relationships, abilities, and experiences of the characters. To maintain high standards, we perform a hybrid quality check process combining both automatic and human verification, ensuring that the questions are diverse, challenging, and discriminative. Our extensive evaluations with RoleEval across various open-source and proprietary LLMs, under both the zero- and few-shot settings, reveal insightful findings. Notably, while GPT-4 outperforms other models on RoleEval-Global, Chinese LLMs excel on RoleEval-Chinese, highlighting significant knowledge distribution differences. We expect that RoleEval would highlight the significance of assessing role knowledge for LLMs across various languages and cultural settings.

Introduction

LLMs have transformed the computational linguistics landscape, demonstrating impressive proficiency in understanding and generating human language. These advancements have opened up new possibilities for AI applications that can interact with users in complex and nuanced ways. Evaluating the role knowledge of these models is crucial as it underpins their ability to maintain coherent and contextually appropriate dialogues, especially in scenarios where character portrayal or personality consistency is required.

Benchmarking Role Knowledge

To benchmark role knowledge in LLMs, a new bilingual evaluation framework called RoleEval was introduced. RoleEval systematically assesses the ability of LLMs to memorize, utilize, and reason with role knowledge, encompassing real-world figures and fictional characters from diverse domains such as celebrities, anime, comics, movies, TV series, games, and fiction. The benchmark includes 6,000 Chinese-English parallel multiple-choice questions, which are divided into two components: RoleEval-Global and RoleEval-Chinese, designed to evaluate LLMs on their understanding of global and China-specific influential characters.

Quality Assurance and Translation

RoleEval structure involves a hybrid quality assurance process combining automatic verification through tools like GPT-3 and human oversight. This meticulous quality check ensures the comprehensiveness and diversity of questions, as well as their discrimination and difficulty levels, making the benchmark robust and challenging. The questions are initially written in Chinese and then translated into English using GPT-4, followed by precise human revisions to maintain the accuracy and integrity of role-related information across languages.

Model Evaluations and Insights

LLMs of varying sizes and languages were put through rigorous zero-shot and few-shot evaluations using RoleEval, uncovering nuanced insights into their performance. GPT-4 leads in RoleEval-Global, while Chinese-specific LLMs like Qwen-72B and Yi-34B perform commendably in RoleEval-Chinese. These findings underscore significant discrepancies in knowledge distribution among models, emphasizing the requirement for further research in cross-lingual and culture-specific role knowledge understanding within LLMs. RoleEval aims to be the stepping stone for future developments in the precise evaluation of role-playing abilities of LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. MPCHAT: Towards Multimodal Persona-Grounded Conversation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3354–3377, Toronto, Canada. Association for Computational Linguistics.
  2. Qwen Technical Report. ArXiv preprint, abs/2309.16609.
  3. Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling. ArXiv preprint, abs/2304.01373.
  4. A Survey on Evaluation of Large Language Models. ArXiv preprint, abs/2307.03109.
  5. Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca. ArXiv preprint, abs/2304.08177.
  6. GLM: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335, Dublin, Ireland. Association for Computational Linguistics.
  7. A framework for few-shot language model evaluation.
  8. Evaluating Large Language Models: A Comprehensive Survey. ArXiv preprint, abs/2310.19736.
  9. Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
  10. C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models. ArXiv preprint, abs/2305.08322.
  11. AI Alignment: A Comprehensive Survey. ArXiv preprint, abs/2310.19852.
  12. Mistral 7B. ArXiv preprint, abs/2310.06825.
  13. ChatHaruhi: Reviving Anime Character in Reality via Large Language Model. ArXiv preprint, abs/2308.09597.
  14. CMMLU: Measuring massive multitask language understanding in Chinese. ArXiv preprint, abs/2306.09212.
  15. The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning. ArXiv preprint, abs/2312.01552.
  16. Towards emotional support dialog systems. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3469–3483, Online. Association for Computational Linguistics.
  17. OpenAI. 2023. GPT-4 Technical Report. ArXiv preprint, abs/2303.08774.
  18. Training language models to follow instructions with human feedback. ArXiv preprint, abs/2203.02155.
  19. Character-LLM: A Trainable Agent for Role-Playing. ArXiv preprint, abs/2310.10158.
  20. Large Language Model Alignment: A Survey. ArXiv preprint, abs/2309.15025.
  21. Am I me or you? state-of-the-art dialogue models cannot maintain an identity. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 2367–2387, Seattle, United States. Association for Computational Linguistics.
  22. LLaMA: Open and Efficient Foundation Language Models. ArXiv preprint, abs/2302.13971.
  23. Llama 2: Open Foundation and Fine-Tuned Chat Models. ArXiv preprint, abs/2307.09288.
  24. A Survey on Large Language Model based Autonomous Agents. ArXiv preprint, abs/2308.11432.
  25. RoleLLM: Benchmarking, Eliciting, and Enhancing Role-Playing Abilities of Large Language Models. ArXiv preprint, abs/2310.00746.
  26. Multi-Party Chat: Conversational Agents in Group Settings with Humans and Models. ArXiv preprint, abs/2304.13835.
  27. Skywork: A More Open Bilingual Foundation Model. ArXiv preprint, abs/2310.19341.
  28. Joseph Weizenbaum. 1966. ELIZA—a computer program for the study of natural language communication between man and machine. Communications of the ACM, 9(1):36–45.
  29. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. ArXiv preprint, abs/2211.05100.
  30. The Rise and Potential of Large Language Model Based Agents: A Survey. ArXiv preprint, abs/2309.07864.
  31. Long time no see! open-domain conversation with long-term persona memory. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2639–2650, Dublin, Ireland. Association for Computational Linguistics.
  32. Baichuan 2: Open Large-scale Language Models. ArXiv preprint, abs/2309.10305.
  33. GLM-130B: An Open Bilingual Pre-trained Model. ArXiv preprint, abs/2210.02414.
  34. Personalizing dialogue agents: I have a dog, do you have pets too? In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2204–2213, Melbourne, Australia. Association for Computational Linguistics.
  35. Personalized Dialogue Generation with Diversified Traits. ArXiv preprint, abs/1901.09672.
  36. LIMA: Less Is More for Alignment. ArXiv preprint, abs/2305.11206.
  37. CharacterGLM: Customizing Chinese Conversational AI Characters with Large Language Models. ArXiv preprint, abs/2311.16832.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Tianhao Shen (15 papers)
  2. Sun Li (5 papers)
  3. Deyi Xiong (103 papers)
  4. Quan Tu (16 papers)
Citations (7)