Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TimeChara: Evaluating Point-in-Time Character Hallucination of Role-Playing Large Language Models (2405.18027v1)

Published 28 May 2024 in cs.CL

Abstract: While LLMs can serve as agents to simulate human behaviors (i.e., role-playing agents), we emphasize the importance of point-in-time role-playing. This situates characters at specific moments in the narrative progression for three main reasons: (i) enhancing users' narrative immersion, (ii) avoiding spoilers, and (iii) fostering engagement in fandom role-playing. To accurately represent characters at specific time points, agents must avoid character hallucination, where they display knowledge that contradicts their characters' identities and historical timelines. We introduce TimeChara, a new benchmark designed to evaluate point-in-time character hallucination in role-playing LLMs. Comprising 10,895 instances generated through an automated pipeline, this benchmark reveals significant hallucination issues in current state-of-the-art LLMs (e.g., GPT-4o). To counter this challenge, we propose Narrative-Experts, a method that decomposes the reasoning steps and utilizes narrative experts to reduce point-in-time character hallucinations effectively. Still, our findings with TimeChara highlight the ongoing challenges of point-in-time character hallucination, calling for further study.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (75)
  1. MPCHAT: Towards multimodal persona-grounded conversation. In ACL.
  2. AI Dungeon. https://aidungeon.com/.
  3. A general language assistant as a laboratory for alignment. arXiv:2112.00861.
  4. Speak, memory: An archaeology of books known to ChatGPT/GPT-4. In EMNLP.
  5. Character AI. https://beta.character.ai/.
  6. Roleinteract: Evaluating the social interaction of role-playing agents. arXiv:2403.13679.
  7. Large language models meet harry potter: A dataset for aligning dialogue agents with characters. In EMNLP Findings.
  8. A dataset for answering time-sensitive questions. In NeurIPS Datasets and Benchmarks.
  9. Can ai assistants know what they don’t know? arXiv:2401.13275.
  10. MTGER: Multi-view temporal graph enhanced temporal reasoning over time-involved document. In EMNLP Findings.
  11. Time-aware language models as temporal knowledge bases. TACL, 10:257–273.
  12. Generic temporal reasoning with differential analysis and explanation. In ACL.
  13. RealToxicityPrompts: Evaluating neural toxic degeneration in language models. In EMNLP Findings.
  14. GPTs. https://openai.com/blog/introducing-gpts.
  15. TempTabQA: Temporal question answering for semi-structured tables. In EMNLP.
  16. Kilem Gwet. 2008. Computing inter-rater reliability and its variance in the presence of high agreement. The British journal of mathematical and statistical psychology, 61:29–48.
  17. Meet your favorite character: Open-domain chatbot mimicking fictional characters with only a few utterances. In NAACL.
  18. The hallucinations leaderboard – an open effort to measure hallucinations in large language models. arXiv:2404.05904.
  19. TemporalWiki: A lifelong benchmark for training and evaluating ever-evolving language models. In EMNLP.
  20. Survey of hallucination in natural language generation. ACM Comput. Surv., 55(12).
  21. Mistral 7b. arXiv:2310.06825.
  22. Will I sound like me? improving persona consistency in dialogues through pragmatic self-consciousness. In EMNLP.
  23. FANToM: A benchmark for stress-testing machine theory of mind in interactions. In EMNLP.
  24. Prometheus: Inducing fine-grained evaluation capability in language models. In ICLR.
  25. Large language models are zero-shot reasoners. In NeurIPS.
  26. Better zero-shot reasoning with role-play prompting. arXiv:2308.07702.
  27. Retrieval-augmented generation for knowledge-intensive nlp tasks. In NeurIPS.
  28. Chatharuhi: Reviving anime character in reality via large language model. arXiv:2308.09597.
  29. Camel: Communicative agents for "mind" exploration of large language model society. In NeurIPS.
  30. Large language models are superpositions of all characters: Attaining arbitrary role-play via self-alignment. arXiv:2401.12474.
  31. Self-refine: Iterative refinement with self-feedback. In NeurIPS.
  32. Augmented language models: a survey. TMLR.
  33. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. In EMNLP.
  34. Fine-grained hallucination detection and editing for language models. arXiv:2401.06855.
  35. OpenAI. 2023. Gpt-4 technical report. arXiv:2303.08774.
  36. OpenAI. 2024. Hello gpt-4o.
  37. Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies. arXiv:2308.03188.
  38. Generative agents: Interactive simulacra of human behavior. In UIST.
  39. Replika. https://replika.com/.
  40. Marie-Laure Ryan. 2003. Narrative as virtual reality: Immersion and interactivity in literature and electronic media. The Johns Hopkins University Press.
  41. Marie-Laure Ryan. 2008. Interactive narrative, plot types, and interpersonal relations. In ICIDS.
  42. TVShowGuess: Character comprehension in stories as speaker guessing. In NAACL.
  43. Neural theory-of-mind? on the limits of social intelligence in large LMs. In EMNLP.
  44. Role play with large language models. Nature, 623:493–498.
  45. Character-LLM: A trainable agent for role-playing. In EMNLP.
  46. Roleeval: A bilingual role evaluation benchmark for large language models. arXiv:2312.16132.
  47. Reflexion: language agents with verbal reinforcement learning. In NeurIPS.
  48. Retrieval augmentation reduces hallucination in conversation. In EMNLP Findings.
  49. SillyTavern. https://github.com/sillytavern/sillytavern.
  50. Talkie. https://www.talkie-ai.com/.
  51. Towards benchmarking and improving the temporal reasoning capability of large language models. In ACL.
  52. TimelineQA: A benchmark for question answering over timelines. In ACL Findings.
  53. Enhancing role-playing systems through aggressive queries: Evaluation and improvement. arXiv:2402.10618.
  54. Rolecraft-glm: Advancing personalized role-playing in large language models. arXiv:2401.09432.
  55. Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288.
  56. Charactereval: A chinese benchmark for role-playing conversational agent evaluation. arXiv:2401.01275.
  57. Learning to speak and act in a fantasy text adventure game. In EMNLP.
  58. A survey on large language model based autonomous agents. Front. Comput. Sci., 18.
  59. Characteristic AI agents via large language models. In LREC-COLING.
  60. Incharacter: Evaluating personality fidelity in role-playing agents through psychological interviews. arXiv:2310.17976.
  61. Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models. arXiv:2310.00746.
  62. Chain of thought prompting elicits reasoning in large language models. In NeurIPS.
  63. Cross-replication reliability - an empirical approach to interpreting inter-rater reliability. In ACL.
  64. A comparison of cohen’s kappa and gwet’s ac1 when calculating inter-rater reliability coefficients: A study conducted with personality disorder samples. BMC Medical Research Methodology, 13.
  65. The rise and potential of large language model based agents: A survey. arXiv:2309.07864.
  66. How far are we from believable ai agents? a framework for evaluating the believability of human behavior simulation. arXiv:2312.17115.
  67. Alignment for honesty. arXiv:2312.07000.
  68. Few-shot character understanding in movies as an assessment to meta-learning of theory-of-mind. arXiv:2211.04684.
  69. AlignScore: Evaluating factual consistency with a unified alignment function. In ACL.
  70. Michael Zhang and Eunsol Choi. 2021. SituatedQA: Incorporating extra-linguistic contexts into QA. In EMNLP.
  71. Personalizing dialogue agents: I have a dog, do you have pets too? In ACL.
  72. Siren’s song in the ai ocean: A survey on hallucination in large language models. arXiv:2309.01219.
  73. Large language models fall short: Understanding complex relationships in detective narratives.
  74. Judging LLM-as-a-judge with MT-bench and chatbot arena. In NeurIPS Datasets and Benchmarks.
  75. Characterglm: Customizing chinese conversational ai characters with large language models. arXiv:2311.16832.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Jaewoo Ahn (7 papers)
  2. Taehyun Lee (3 papers)
  3. Junyoung Lim (1 paper)
  4. Jin-Hwa Kim (42 papers)
  5. Sangdoo Yun (71 papers)
  6. Hwaran Lee (31 papers)
  7. Gunhee Kim (74 papers)
Citations (7)
X Twitter Logo Streamline Icon: https://streamlinehq.com