TimeChara: Evaluating Point-in-Time Character Hallucination of Role-Playing Large Language Models (2405.18027v1)
Abstract: While LLMs can serve as agents to simulate human behaviors (i.e., role-playing agents), we emphasize the importance of point-in-time role-playing. This situates characters at specific moments in the narrative progression for three main reasons: (i) enhancing users' narrative immersion, (ii) avoiding spoilers, and (iii) fostering engagement in fandom role-playing. To accurately represent characters at specific time points, agents must avoid character hallucination, where they display knowledge that contradicts their characters' identities and historical timelines. We introduce TimeChara, a new benchmark designed to evaluate point-in-time character hallucination in role-playing LLMs. Comprising 10,895 instances generated through an automated pipeline, this benchmark reveals significant hallucination issues in current state-of-the-art LLMs (e.g., GPT-4o). To counter this challenge, we propose Narrative-Experts, a method that decomposes the reasoning steps and utilizes narrative experts to reduce point-in-time character hallucinations effectively. Still, our findings with TimeChara highlight the ongoing challenges of point-in-time character hallucination, calling for further study.
- MPCHAT: Towards multimodal persona-grounded conversation. In ACL.
- AI Dungeon. https://aidungeon.com/.
- A general language assistant as a laboratory for alignment. arXiv:2112.00861.
- Speak, memory: An archaeology of books known to ChatGPT/GPT-4. In EMNLP.
- Character AI. https://beta.character.ai/.
- Roleinteract: Evaluating the social interaction of role-playing agents. arXiv:2403.13679.
- Large language models meet harry potter: A dataset for aligning dialogue agents with characters. In EMNLP Findings.
- A dataset for answering time-sensitive questions. In NeurIPS Datasets and Benchmarks.
- Can ai assistants know what they don’t know? arXiv:2401.13275.
- MTGER: Multi-view temporal graph enhanced temporal reasoning over time-involved document. In EMNLP Findings.
- Time-aware language models as temporal knowledge bases. TACL, 10:257–273.
- Generic temporal reasoning with differential analysis and explanation. In ACL.
- RealToxicityPrompts: Evaluating neural toxic degeneration in language models. In EMNLP Findings.
- GPTs. https://openai.com/blog/introducing-gpts.
- TempTabQA: Temporal question answering for semi-structured tables. In EMNLP.
- Kilem Gwet. 2008. Computing inter-rater reliability and its variance in the presence of high agreement. The British journal of mathematical and statistical psychology, 61:29–48.
- Meet your favorite character: Open-domain chatbot mimicking fictional characters with only a few utterances. In NAACL.
- The hallucinations leaderboard – an open effort to measure hallucinations in large language models. arXiv:2404.05904.
- TemporalWiki: A lifelong benchmark for training and evaluating ever-evolving language models. In EMNLP.
- Survey of hallucination in natural language generation. ACM Comput. Surv., 55(12).
- Mistral 7b. arXiv:2310.06825.
- Will I sound like me? improving persona consistency in dialogues through pragmatic self-consciousness. In EMNLP.
- FANToM: A benchmark for stress-testing machine theory of mind in interactions. In EMNLP.
- Prometheus: Inducing fine-grained evaluation capability in language models. In ICLR.
- Large language models are zero-shot reasoners. In NeurIPS.
- Better zero-shot reasoning with role-play prompting. arXiv:2308.07702.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. In NeurIPS.
- Chatharuhi: Reviving anime character in reality via large language model. arXiv:2308.09597.
- Camel: Communicative agents for "mind" exploration of large language model society. In NeurIPS.
- Large language models are superpositions of all characters: Attaining arbitrary role-play via self-alignment. arXiv:2401.12474.
- Self-refine: Iterative refinement with self-feedback. In NeurIPS.
- Augmented language models: a survey. TMLR.
- FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. In EMNLP.
- Fine-grained hallucination detection and editing for language models. arXiv:2401.06855.
- OpenAI. 2023. Gpt-4 technical report. arXiv:2303.08774.
- OpenAI. 2024. Hello gpt-4o.
- Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies. arXiv:2308.03188.
- Generative agents: Interactive simulacra of human behavior. In UIST.
- Replika. https://replika.com/.
- Marie-Laure Ryan. 2003. Narrative as virtual reality: Immersion and interactivity in literature and electronic media. The Johns Hopkins University Press.
- Marie-Laure Ryan. 2008. Interactive narrative, plot types, and interpersonal relations. In ICIDS.
- TVShowGuess: Character comprehension in stories as speaker guessing. In NAACL.
- Neural theory-of-mind? on the limits of social intelligence in large LMs. In EMNLP.
- Role play with large language models. Nature, 623:493–498.
- Character-LLM: A trainable agent for role-playing. In EMNLP.
- Roleeval: A bilingual role evaluation benchmark for large language models. arXiv:2312.16132.
- Reflexion: language agents with verbal reinforcement learning. In NeurIPS.
- Retrieval augmentation reduces hallucination in conversation. In EMNLP Findings.
- SillyTavern. https://github.com/sillytavern/sillytavern.
- Talkie. https://www.talkie-ai.com/.
- Towards benchmarking and improving the temporal reasoning capability of large language models. In ACL.
- TimelineQA: A benchmark for question answering over timelines. In ACL Findings.
- Enhancing role-playing systems through aggressive queries: Evaluation and improvement. arXiv:2402.10618.
- Rolecraft-glm: Advancing personalized role-playing in large language models. arXiv:2401.09432.
- Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288.
- Charactereval: A chinese benchmark for role-playing conversational agent evaluation. arXiv:2401.01275.
- Learning to speak and act in a fantasy text adventure game. In EMNLP.
- A survey on large language model based autonomous agents. Front. Comput. Sci., 18.
- Characteristic AI agents via large language models. In LREC-COLING.
- Incharacter: Evaluating personality fidelity in role-playing agents through psychological interviews. arXiv:2310.17976.
- Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models. arXiv:2310.00746.
- Chain of thought prompting elicits reasoning in large language models. In NeurIPS.
- Cross-replication reliability - an empirical approach to interpreting inter-rater reliability. In ACL.
- A comparison of cohen’s kappa and gwet’s ac1 when calculating inter-rater reliability coefficients: A study conducted with personality disorder samples. BMC Medical Research Methodology, 13.
- The rise and potential of large language model based agents: A survey. arXiv:2309.07864.
- How far are we from believable ai agents? a framework for evaluating the believability of human behavior simulation. arXiv:2312.17115.
- Alignment for honesty. arXiv:2312.07000.
- Few-shot character understanding in movies as an assessment to meta-learning of theory-of-mind. arXiv:2211.04684.
- AlignScore: Evaluating factual consistency with a unified alignment function. In ACL.
- Michael Zhang and Eunsol Choi. 2021. SituatedQA: Incorporating extra-linguistic contexts into QA. In EMNLP.
- Personalizing dialogue agents: I have a dog, do you have pets too? In ACL.
- Siren’s song in the ai ocean: A survey on hallucination in large language models. arXiv:2309.01219.
- Large language models fall short: Understanding complex relationships in detective narratives.
- Judging LLM-as-a-judge with MT-bench and chatbot arena. In NeurIPS Datasets and Benchmarks.
- Characterglm: Customizing chinese conversational ai characters with large language models. arXiv:2311.16832.
- Jaewoo Ahn (7 papers)
- Taehyun Lee (3 papers)
- Junyoung Lim (1 paper)
- Jin-Hwa Kim (42 papers)
- Sangdoo Yun (71 papers)
- Hwaran Lee (31 papers)
- Gunhee Kim (74 papers)