RoleBreak: Character Hallucination as a Jailbreak Attack in Role-Playing Systems (2409.16727v1)

Published 25 Sep 2024 in cs.CL

Abstract: Role-playing systems powered by LLMs have become increasingly influential in emotional communication applications. However, these systems are susceptible to character hallucinations, where the model deviates from predefined character roles and generates responses that are inconsistent with the intended persona. This paper presents the first systematic analysis of character hallucination from an attack perspective, introducing the RoleBreak framework. Our framework identifies two core mechanisms-query sparsity and role-query conflict-as key factors driving character hallucination. Leveraging these insights, we construct a novel dataset, RoleBreakEval, to evaluate existing hallucination mitigation techniques. Our experiments reveal that even enhanced models trained to minimize hallucination remain vulnerable to attacks. To address these vulnerabilities, we propose a novel defence strategy, the Narrator Mode, which generates supplemental context through narration to mitigate role-query conflicts and improve query generalization. Experimental results demonstrate that Narrator Mode significantly outperforms traditional refusal-based strategies by reducing hallucinations, enhancing fidelity to character roles and queries, and improving overall narrative coherence.

Summary

The paper presents RoleBreak, a framework that identifies and analyzes key triggers of character hallucination, including query sparsity and role-query conflict.
It introduces RoleBreakEval, a dataset used to evaluate mitigation techniques, revealing that even advanced models like GPT-3.5 and Claude-3 are vulnerable.
The study proposes Narrator Mode, a novel defense mechanism that enriches narrative context to enhance role fidelity and storytelling coherence.

Analysis of RoleBreak: Character Hallucination as a Jailbreak Attack in Role-Playing Systems

The paper "RoleBreak: Character Hallucination as a Jailbreak Attack in Role-Playing Systems" advances the understanding of role-playing systems vulnerable to character hallucinations, a deviation of response from an intended persona in LLM-based dialogues. This investigation introduces a novel framework, RoleBreak, which systematically analyzes hallucination through an attack and defence perspective.

Core Mechanisms

The research identifies two primary mechanisms responsible for character hallucination: query sparsity and role-query conflict. Query sparsity results from the insufficient coverage of role-specific queries during model training, leading models to generate inadequate responses when confronted with unfamiliar or diverse queries. Role-query conflict arises when discrepancies between predefined role settings and user queries challenge the model's ability to manage content generation consistently.

RoleBreakEval Dataset

The RoleBreak framework facilitates the development of RoleBreakEval, a dataset designed to evaluate hallucination mitigation techniques. The dataset capitalizes on principles of query sparsity and role-query conflict, challenging models' susceptibility to hallucination even when enhanced with adversarial training.

Experimental Observations

Analyses using various LLM configurations highlight that even robust models like GPT-3.5 and Claude-3 are vulnerable to the RoleBreak attack. Enhanced models show marginal improvements; however, they remain inadequate in fully eliminating hallucinations. Yet, traditional refusal-based strategies, while effective in reducing hallucination rates, compromise the models' ability to craft coherent and engaging narratives.

Narrator Mode: A Proposed Defence

To circumvent these vulnerabilities, the paper proposes Narrator Mode, a defence mechanism augmenting narrative context to improve query generalization and resolve role-query conflicts. By generating supplemental content that enriches the storyline, Narrator Mode surpasses conventional refusal models, ensuring better alignment with character roles and enhancing storytelling coherence.

Evaluation Metrics

The evaluation encompasses metrics such as hallucination rate (HR), role fidelity (RF), query fidelity (QF), and story coherence (SC). The hallucination mitigation techniques previously reliant on rejection strategies demonstrated limited generalization compared to the robust performance of Narrator Mode across these dimensions.

Implications and Speculations on Future Directions

The implications of this research are twofold: practically, it provides a framework for improving role-playing fidelity and theoretically, it enriches our understanding of LLM behaviour under adversarial conditions. The introduction of RoleBreak and Narrator Mode reveals a path towards enhancing the creative potential and fidelity of role-playing systems. Future research could explore more refined character management techniques and dynamically generated narratives to address increasingly complex role-playing scenarios, fortifying models against diverse and nuanced hallucination challenges.

In conclusion, this paper underscores the critical need to address character hallucination within role-playing systems. Through systematic analysis and innovative solutions, it lays a foundation for advancing the sophistication and reliability of LLM-driven role interactions.

PDF Markdown

Related Papers

Tweets

https://twitter.com/rohanpaul_ai/status/1843798763434910112

https://twitter.com/gm8xx8/status/1839151835224981524

YouTube

Show All Videos