Introduction
LLMs have transformed the computational linguistics landscape, demonstrating impressive proficiency in understanding and generating human language. These advancements have opened up new possibilities for AI applications that can interact with users in complex and nuanced ways. Evaluating the role knowledge of these models is crucial as it underpins their ability to maintain coherent and contextually appropriate dialogues, especially in scenarios where character portrayal or personality consistency is required.
Benchmarking Role Knowledge
To benchmark role knowledge in LLMs, a new bilingual evaluation framework called RoleEval was introduced. RoleEval systematically assesses the ability of LLMs to memorize, utilize, and reason with role knowledge, encompassing real-world figures and fictional characters from diverse domains such as celebrities, anime, comics, movies, TV series, games, and fiction. The benchmark includes 6,000 Chinese-English parallel multiple-choice questions, which are divided into two components: RoleEval-Global and RoleEval-Chinese, designed to evaluate LLMs on their understanding of global and China-specific influential characters.
Quality Assurance and Translation
RoleEval structure involves a hybrid quality assurance process combining automatic verification through tools like GPT-3 and human oversight. This meticulous quality check ensures the comprehensiveness and diversity of questions, as well as their discrimination and difficulty levels, making the benchmark robust and challenging. The questions are initially written in Chinese and then translated into English using GPT-4, followed by precise human revisions to maintain the accuracy and integrity of role-related information across languages.
Model Evaluations and Insights
LLMs of varying sizes and languages were put through rigorous zero-shot and few-shot evaluations using RoleEval, uncovering nuanced insights into their performance. GPT-4 leads in RoleEval-Global, while Chinese-specific LLMs like Qwen-72B and Yi-34B perform commendably in RoleEval-Chinese. These findings underscore significant discrepancies in knowledge distribution among models, emphasizing the requirement for further research in cross-lingual and culture-specific role knowledge understanding within LLMs. RoleEval aims to be the stepping stone for future developments in the precise evaluation of role-playing abilities of LLMs.