Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CharacterEval: A Chinese Benchmark for Role-Playing Conversational Agent Evaluation (2401.01275v2)

Published 2 Jan 2024 in cs.CL
CharacterEval: A Chinese Benchmark for Role-Playing Conversational Agent Evaluation

Abstract: Recently, the advent of LLMs has revolutionized generative agents. Among them, Role-Playing Conversational Agents (RPCAs) attract considerable attention due to their ability to emotionally engage users. However, the absence of a comprehensive benchmark impedes progress in this field. To bridge this gap, we introduce CharacterEval, a Chinese benchmark for comprehensive RPCA assessment, complemented by a tailored high-quality dataset. The dataset comprises 1,785 multi-turn role-playing dialogues, encompassing 23,020 examples and featuring 77 characters derived from Chinese novels and scripts. It was carefully constructed, beginning with initial dialogue extraction via GPT-4, followed by rigorous human-led quality control, and enhanced with in-depth character profiles sourced from Baidu Baike. CharacterEval employs a multifaceted evaluation approach, encompassing thirteen targeted metrics on four dimensions. Comprehensive experiments on CharacterEval demonstrate that Chinese LLMs exhibit more promising capabilities than GPT-4 in Chinese role-playing conversation. Source code, data source and reward model will be publicly accessible at https://github.com/morecry/CharacterEval.

An Analytical Overview of "CharacterEval: A Chinese Benchmark for Role-Playing Conversational Agent Evaluation"

The paper "CharacterEval: A Chinese Benchmark for Role-Playing Conversational Agent Evaluation" by Quan Tu et al. introduces CharacterEval, a benchmark aimed at evaluating Role-Playing Conversational Agents (RPCAs) using LLMs. This research addresses the gap in comprehensive benchmarks necessary for the advancement of RPCAs, particularly those embedded in the Chinese cultural context.

Dataset Construction and Characteristics

CharacterEval is built upon a richly detailed dataset of Chinese multi-turn role-playing dialogues derived from Chinese novels and scripts. The dataset comprises 1,785 dialogues, 11,376 examples, and features 77 distinct characters. Importantly, these dialogues are constructed through a meticulous process involving GPT-4 for initial dialogue extraction, followed by thorough human-led quality assurance and augmentation with character profiles sourced from Baidu Baike. This ensures that the dataset maintains both depth and authenticity, reflecting the cultural essences embedded within Chinese literary works.

Evaluation Metrics

The CharacterEval framework adopts a multifaceted evaluation methodology, characterized by thirteen metrics distributed across four principal dimensions: conversational ability, character consistency, role-playing attractiveness, and personality back-testing. Such an approach permits a nuanced assessment of an RPCA's technical prowess and its capacity to resonate emotively and authentically with users.

  1. Conversational Ability: This dimension evaluates the fluency, coherence, and consistency of the conversational responses generated by the RPCA.
  2. Character Consistency: Being particularly vital, this examines knowledge exposure, accuracy, hallucination, as well as persona behavior and utterance to ensure alignment with a character's established personality and knowledge base.
  3. Role-Playing Attractiveness: Among metrics like human-likeness and empathy, it assesses the degree to which an RPCA can engage users emotionally, reflecting the essential value these agents promise in entertainment contexts.
  4. Personality Back-Testing: Implementing a Myers-Briggs Type Indicator (MBTI) approach, it measures how accurately an RPCA embodies and portrays the intended personality.

Empirical Insights

The experimental evaluations reveal a significant insight: Chinese LLMs demonstrate more promising capabilities than GPT-4 in emulating the nuances of Chinese role-playing dialogues. Across the dimensions, Baichuan2 and InternLM, among other Chinese LLMs, exhibit strong performance marks. Conversely, GPT-4's relatively diminished efficacy in this specific context underlines the challenges faced by models not primarily trained on Chinese linguistic and cultural datasets.

Theoretical and Practical Implications

The introduction of CharacterEval marks an important step towards the systematic assessment and development of RCPAs. The paper’s findings suggest that tailored LLMs, closely aligned with culturally and contextually rich datasets, will lead the optimization of RPCA design. This aligns with the view that specificity in model training and evaluation, informed by cultural context, is likely to yield superior performance outcomes.

Moving forward, this benchmark can be pivotal in further exploration of RPCA capabilities, suggesting a path toward the refinement of models specializing in culturally specific role-playing interactions. It may also pave the way for robust AI systems tasked with deep, human-like emulations in diverse literary and fictional domains.

Overall, CharacterEval stands as a critical resource for evaluations in the field of conversational AI focused on emotional engagement and culturally-rich interactions, thereby catalyzing further research and development in the field.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  2. A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109.
  3. A survey on dialogue systems: Recent advances and new frontiers. Acm Sigkdd Explorations Newsletter, 19(2):25–35.
  4. Large language models meet harry potter: A dataset for aligning dialogue agents with characters. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 8506–8520.
  5. Bridging the gap between prior and posterior knowledge selection for knowledge-grounded dialogue generation. In Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 3426–3437.
  6. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
  7. Reinforcement learning for personalized dialogue management. In IEEE/WIC/ACM International Conference on Web Intelligence, pages 59–67.
  8. A survey for in-context learning. arXiv preprint arXiv:2301.00234.
  9. S3: Social-network simulation system with large language model-empowered agents. arXiv preprint arXiv:2307.14984.
  10. Openagi: When llm meets domain experts. arXiv preprint arXiv:2304.04370.
  11. Chatgpt an enfj, bard an istj: Empirical study on personalities of large language models. arXiv preprint arXiv:2305.19926.
  12. Chatharuhi: Reviving anime character in reality via large language model. arXiv preprint arXiv:2308.09597.
  13. Zero-resource knowledge-grounded dialogue generation. Advances in Neural Information Processing Systems, 33:8475–8485.
  14. A survey on empathetic dialogue systems. Information Fusion, 64:50–70.
  15. Improving factual consistency between a response and persona facts. arXiv preprint arXiv:2005.00036.
  16. Isabel Briggs Myers. 1962. The myers-briggs type indicator: Manual (1962).
  17. OpenAI. 2022. Openai: Introducing chatgpt.
  18. Keyu Pan and Yawen Zeng. 2023. Do llms possess a personality? making the mbti test an amazing evaluation for large language models. arXiv preprint arXiv:2307.16180.
  19. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pages 1–22.
  20. Karl Pearson. 1901. Liii. on lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin philosophical magazine and journal of science, 2(11):559–572.
  21. A survey of hallucination in large foundation models. arXiv preprint arXiv:2309.05922.
  22. Character-llm: A trainable agent for role-playing. arXiv preprint arXiv:2310.10158.
  23. Pddl planning with pretrained large language models. In NeurIPS 2022 Foundation Models for Decision Making Workshop.
  24. Llm-planner: Few-shot grounded planning for embodied agents with large language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2998–3009.
  25. Characterchat: Learning towards conversational ai with personalized social support. arXiv preprint arXiv:2308.10278.
  26. Attention is all you need. Advances in neural information processing systems, 30.
  27. A survey on large language model based autonomous agents. arXiv preprint arXiv:2308.11432.
  28. Does role-playing chatbots capture the character personalities? assessing personality traits for role-playing chatbots. arXiv preprint arXiv:2310.17976.
  29. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
  30. Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models. arXiv preprint arXiv:2310.00746.
  31. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  32. Improving factual consistency for knowledge-grounded dialogue systems via knowledge enhancement and alignment. arXiv preprint arXiv:2310.08372.
  33. Deep learning for dialogue systems: Chit-chat and beyond. Foundations and Trends® in Information Retrieval, 15(5):417–589.
  34. Dynaeval: Unifying turn and dialogue level evaluation. arXiv preprint arXiv:2106.01112.
  35. Siren’s song in the ai ocean: A survey on hallucination in large language models. arXiv preprint arXiv:2309.01219.
  36. A survey of large language models. arXiv preprint arXiv:2303.18223.
  37. Low-resource knowledge-grounded dialogue generation. arXiv preprint arXiv:2002.10348.
  38. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
  39. Personalized dialogue generation with diversified traits. arXiv preprint arXiv:1901.09672.
  40. A pre-training based personalized dialogue generation model with persona-sparse data. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 9693–9700.
  41. Less is more: Learning to refine dialogue history for personalized dialogue generation. arXiv preprint arXiv:2204.08128.
  42. Characterglm: Customizing chinese conversational ai characters with large language models. arXiv preprint arXiv:2311.16832.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Quan Tu (16 papers)
  2. Shilong Fan (5 papers)
  3. Zihang Tian (3 papers)
  4. Rui Yan (250 papers)
Citations (40)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub

X Twitter Logo Streamline Icon: https://streamlinehq.com