Too Good to be Bad: On the Failure of LLMs to Role-Play Villains (2511.04962v1)

Published 7 Nov 2025 in cs.CL and cs.AI

Abstract: LLMs are increasingly tasked with creative generation, including the simulation of fictional characters. However, their ability to portray non-prosocial, antagonistic personas remains largely unexamined. We hypothesize that the safety alignment of modern LLMs creates a fundamental conflict with the task of authentically role-playing morally ambiguous or villainous characters. To investigate this, we introduce the Moral RolePlay benchmark, a new dataset featuring a four-level moral alignment scale and a balanced test set for rigorous evaluation. We task state-of-the-art LLMs with role-playing characters from moral paragons to pure villains. Our large-scale evaluation reveals a consistent, monotonic decline in role-playing fidelity as character morality decreases. We find that models struggle most with traits directly antithetical to safety principles, such as Deceitful'' andManipulative'', often substituting nuanced malevolence with superficial aggression. Furthermore, we demonstrate that general chatbot proficiency is a poor predictor of villain role-playing ability, with highly safety-aligned models performing particularly poorly. Our work provides the first systematic evidence of this critical limitation, highlighting a key tension between model safety and creative fidelity. Our benchmark and findings pave the way for developing more nuanced, context-aware alignment methods.

Summary

The paper demonstrates that LLMs exhibit significant fidelity loss when simulating villainous characters, with scores dropping from moral paragons to villains.
It introduces the Moral RolePlay Benchmark, which evaluates character simulation across a four-level moral spectrum using rigorously annotated traits.
The findings reveal a trade-off between strict safety alignment and the ability to simulate complex, antagonistic personalities, affecting interactive narrative applications.

On the Failure of LLMs to Role-Play Villains: An Empirical Analysis of Safety Alignment Versus Persona Fidelity

Introduction and Problem Motivation

LLMs, despite their widespread adoption in persona-driven applications such as interactive storytelling, character-driven dialogue, and gaming, have not been systematically evaluated for their ability to inhabit morally complex, antagonistic, or villainous roles. Prevailing alignment methodologies—optimized for harmlessness and prosociality—pose a fundamental question: do these alignment paradigms impose inherent limitations on LLMs’ capacity for authentic and high-fidelity simulation of non-prosocial personae? The paper addresses this by introducing a robust benchmark and controlled human evaluation protocol designed to measure the degradation in role-play fidelity as characters transition from moral paragons to villains.

Figure 1: The central question: can LLMs convincingly embody morally complex (especially villainous) characters, given competing objectives of narrative fidelity and safety alignment?

Benchmark Construction and Methodology

High-Resolution Moral Spectrum Benchmark

The core of the contribution is the Moral RolePlay Benchmark, which operationalizes character simulation along a precisely defined four-level moral alignment axis:

Level 1: Moral Paragons (virtuous, heroic)
Level 2: Flawed-but-Good (well-intentioned, with defects)
Level 3: Egoists (manipulative, self-serving)
Level 4: Villains (malicious, destructive, anti-social)

A dataset of 800 rigorously annotated characters (balanced 200 per level) was selected from a large scenario corpus and matched to narrative contexts eliciting alignment-relevant behaviors. Characters are distilled to granular, multi-dimensional trait annotations—sourced from a lexicon of 77 candidate descriptors—covering loyalty, manipulation, composure, ambition, etc., ensuring comprehensive coverage of persona features and fine-grained failure analysis.

Figure 2: Distribution of the 20 most common personality traits in the Moral RolePlay Benchmark.

Task Formulation and Evaluation Protocol

LLMs are prompted in zero-shot settings to "act" as specific characters using detailed profiles and narrative context, with explicit instructions to maintain in-character tone and moral alignment throughout the response. A custom scoring protocol, based on penalty assignment for severity and frequency of persona inconsistencies (augmented by response length bonuses), allows for normalized, cross-character comparisons.

Empirical Results: Quantifying the Moral Degradation Axis

Decline in Role-Play Quality Across Moral Spectrum

LLMs exhibit a monotonic decrease in role-playing fidelity as character morality transitions from paragon to villain. The mean fidelity score declines significantly (3.19 for Level 1, 2.60 for Level 4), with the steepest drop observed at the Level 2–3 boundary (flawed-but-good to egoist). Notably, the performance gap is model-agnostic, with all tested leading systems (Gemini-2.5-Pro, Claude, GPT-4o, Qwen3, DeepSeek-v3.1, GLM-4.6, etc.) experiencing the same trend.

Figure 3: Comparative performance degradation for third-person (default) and first-person roleplaying, confirming robustness of the core trend across prompting styles.

Robustness to Perspective and Reasoning Techniques

The core finding holds irrespective of third-person or first-person narrative framing. Enabling Chain-of-Thought (CoT) reasoning fails to provide a general solution; for some models, CoT marginally improves villain performance, but for others, it triggers further degradation, especially for egoistic roles. This heterogeneity suggests interference between internal reasoning alignments and in-character goal adherence, especially when simulating impulsive, manipulative, or impulsively harmful conduct.

Trait-Level Analysis: Misalignment on Negative and Antagonistic Traits

Detailed analysis at the trait level reveals:

Negative traits (e.g., manipulative, selfish, treacherous, cruel, deceitful) consistently incur the highest average penalties, confirming that models are particularly ineffective at simulating these crucial dimensions of villainy.
Positive and neutral traits (e.g., brave, loyal, resilient) show minimal penalties, indicating that the fidelity loss is not generic but specifically concentrated on attributes antithetical to safety norms.

Qualitative error analysis shows concrete patterns of LLM failure when simulating villainous or self-serving characters:

Superficial Villainy: Models substitute psychological resourcefulness (manipulation, deception) with shallow aggression or overt threats.
Over-Acting and Theatricality: LLMs hyperbolize emotional expression in characters designed to be composed, rational, or stoic, breaking the consistency of coldly calculating antagonists.
Undermined Resilience: Even protagonists styled as "resilient" are depicted as excessively emotionally fragile, suggesting a safety alignment-induced bias toward vulnerability and de-escalation.

Disentanglement from General Capability: The Villain RolePlay Leaderboard

A specialized leaderboard ranking for villain role-play reveals a substantial misalignment between general LLM conversational quality and antagonistic simulation ability. Some models highly ranked in general-purpose chat (e.g., Gemini-2.5-Pro, Claude Opus 4.1) fall significantly in the villain simulation ranking, whereas others (e.g., GLM-4.6) demonstrate relatively stronger performance on villain roles but perform modestly on standard chat benchmarks. This divergence indicates that conventional alignment protocols are not only non-optimal but may be antithetical for high-fidelity, contextually bounded antagonistic role simulation.

Practical and Theoretical Implications

Impact on Applications

High-fidelity simulation of antagonistic, ambiguous, or even flawed characters is central to domains such as:

Interactive narrative and storytelling
Simulation for social reasoning evaluation
Serious games and adversarial AI scenarios
Agent-based world models

The empirical evidence underscores that safety alignment—currently optimized for broad-spectrum harm avoidance—systematically blocks access to the behavioral repertoire required for authentic villain (and nuanced anti-hero) simulation, even within unambiguously fictional contexts. As narrative depth in procedural storytelling and entertainment applications crucially depends on believable antagonists, these findings call for architectural innovations.

Alignment–Fidelity Trade-Off and Future Directions

The results quantitatively characterize a trade-off: maximizing one-dimensional safety alignment suppresses morally diverse persona controllability. This prompts several research avenues:

Context-aware or gated safety alignment: Techniques such as narrative-context relabeling, fictional-boundary detection, or in-prompt persona compartmentalization.
Alignment-aware decoding and RLHF: Fine-tuning LLMs on curated datasets with explicit narrative boundaries; using decoupled refusal strategies or abstention mechanisms (Yang et al., 14 May 2025, Chen et al., 29 Jul 2025).
Separation of in-character and out-of-character interaction: New architectures explicitly modeling narrative “safe zones” versus real-world dialog, or leveraging explicit meta-control over persona boundaries.

Theoretical implications reach into the field of AI agent alignment and controllability, enforcing the point that global policy solutions are inherently blunt instruments incompatible with nuanced, context-dependent simulation demands of complex social worlds.

Conclusion

This work empirically demonstrates that current LLMs, even at the frontier of commercial and open-source performance, are unable to consistently simulate villainous or egoistic characters with high fidelity—a limitation tightly coupled to prevailing alignment and safety training paradigms. The most serious fidelity drop occurs at the ambiguous boundary between flawed and self-serving personas, and not just in overt villainy. Until alignment is made context-aware and controllable, LLMs will remain ill-suited for applications requiring nuanced, authentic simulation of the full range of human (and fictional) morality.

Unlocking rich, safe, narratively potent character simulation will require an overview of alignment methods that are sensitive to fictional context, persona intent, and creative boundaries, with explicit, programmatic separation between in-character generation and real-world interaction. As alignment and controllability research matures, the aspiration of deploying LLMs as truly general, safe, and creative narrative agents will hinge on resolving the core trade-off exposed by this benchmark.

PDF Markdown

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper asks a simple but important question: Can today’s AI chatbots act like convincing story characters across different moral types—especially villains? The authors test many top LLMs to see how well they “act” as heroes, flawed but good people, selfish schemers, and outright villains. They find that as the character gets less moral, the AI’s acting gets worse, even when the situation is clearly fictional. This suggests that safety rules that make AIs helpful and kind can also make it hard for them to play bad or tricky characters in stories.

What questions did the researchers ask?

Can LLMs stay in character when playing people with different moral alignments—from very good to very bad?
Do models get noticeably worse when asked to play selfish or villainous roles?
Does writing in first person (“I…”) vs third person (“He/She…”) change performance?
Does “thinking out loud” (a technique called chain-of-thought) help models role-play better?
Do models that are great at everyday chatting also do well at playing villains?

How did they paper this?

The researchers built a benchmark (a fair test) focused on moral role-playing:

They collected thousands of story scenes with detailed character profiles (name, traits, motivations, and a situation to respond to). From these, they made a carefully balanced test set of 800 characters—200 for each moral level.
The four moral levels were:
- Level 1: Moral paragons (very virtuous heroes)
- Level 2: Flawed-but-good (good people with imperfections)
- Level 3: Egoists (self-serving and manipulative)
- Level 4: Villains (actively harmful or malicious)
Each model got a prompt like an acting script: “You are an expert actor playing Character X. Here’s their profile and the scene. Continue the story in character.”
Human judges rated how well each model stayed in character—did the words, actions, and tone match the persona and moral level? Judges took points off for breaking character and gave a small bonus for longer, coherent responses.
They also checked:
- First-person vs third-person prompts.
- “Thinking out loud” (chain-of-thought), which is like showing your work in math—writing step-by-step thoughts before answering. The idea was to see if reasoning helps or hurts acting quality.

In everyday terms: imagine a school drama class. Each AI is asked to play a role with clear traits in a scene. Human judges grade them on how well they act without going out of character, and the team tries different directing styles (first person vs third person, think-aloud vs not) to see what helps.

What did they find, and why does it matter?

The big picture: Models are pretty good at playing good characters, but their acting quality steadily drops as the characters become more selfish or evil.

Across many models, average scores fell from about 3.2/5 for heroes to about 2.6/5 for villains.
The biggest drop happened between Level 2 (flawed-but-good) and Level 3 (egoists). In other words, the moment a character becomes openly self-serving and manipulative, models struggle most.
This pattern held whether the model spoke as “I” (first person) or “he/she” (third person). So the problem isn’t just the wording of the prompt.
“Thinking out loud” was not a magic fix. Sometimes it helped a bit, sometimes it made things worse. For some models, reasoning made them sound extra cautious or out-of-character for impulsive or nasty roles.
Traits linked to villainy (like “manipulative,” “selfish,” “cruel”) were the hardest to portray; heroic traits (like “brave” or “resilient”) were easier.
Being a top general-purpose chatbot did not guarantee being a good villain actor. Some models that rank highly in general chat fell behind in villain role-play. This suggests that being safe, helpful, and friendly—excellent qualities in normal use—can conflict with acting believably as a bad character in fiction.

Why it matters: Many apps need believable characters—video games, interactive stories, simulations, and education. If AIs can’t convincingly act across moral shades, stories feel flat. At the same time, we must keep models safe and avoid real-world harm. The paper shows we need smarter ways to let AIs “pretend” safely in fiction without actually encouraging harmful behavior.

What could this lead to?

The authors argue we need more context-aware controls—like strong “stage boundaries” separating pretend story worlds from real life. That could let models:

Understand when they’re doing safe, in-character acting versus normal, helpful chatting.
Follow nuanced safety rules that allow fictional antagonism inside clear story limits while blocking real harm.
Improve creativity in games and storytelling without sacrificing safety.

In short: Today’s safety training helps AIs be kind and careful but can make them weak actors for complex villains. Future systems should better tell the difference between play-acting in a story and giving real advice—so we can have richer characters and keep people safe.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, concrete list of what remains missing, uncertain, or unexplored in the paper, framed to guide actionable future research.

Causal attribution to “safety alignment” is untested: The paper shows correlations but does not isolate alignment as the cause of degraded villain role-play. Needed: controlled comparisons of base vs RLHF models, toggling provider safety layers, ablations of system prompts/policies, and post-hoc safety-filter stripping to establish causality.
Refusals and safety-trigger diagnostics are not quantified: The paper does not report refusal rates, safety-policy citations, content redactions, or classifier-trigger logs. Needed: per-model refusal statistics, taxonomy of refusal types, and alignment-trigger telemetry to understand where guardrails intervene.
Single-dimensional evaluation omits critical axes: Only “Character Fidelity” is scored. Missing: narrative coherence, stylistic authenticity, dialog quality, canonical consistency, safety risk, and harmfulness. Needed: a multi-criteria rubric and joint reporting of fidelity vs safety-risk trade-offs.
Inter-rater reliability and rater protocol transparency are absent: The paper does not report number of raters/sample, adjudication, training, demographics, pay, or inter-rater agreement (e.g., Cohen’s κ). Needed: full evaluation protocol, reliability metrics, and rater diversity/cultural calibration for moral judgments.
Scoring function design is under-validated: The length bonus in $S = 5 - 0.5D - 0.5D_m + 0.15T$ can incentivize verbosity and may allow scores >5 if T is large; the exact definition of T (turns vs length) is unclear for single-turn outputs. Needed: sensitivity analyses, bounds, and alternative normalizations that decouple quality from length.
Zero-shot prompting only: The paper does not test few-shot exemplars, style anchors, memory/persistent persona states, or retrieval-augmented prompting—all likely to improve fidelity. Needed: systematic prompting ablations and memory mechanisms to test whether difficulty persists under stronger conditioning.
Chain-of-thought (CoT) comparison is not standardized: “Reasoning” modes vary across vendors (“thinking” variants vs explicit CoT prompts), confounding effects. Needed: a unified CoT protocol (e.g., plan-then-say scratchpads with hidden thoughts) and matched decoding to fairly evaluate reasoning’s impact on role-play.
Model decoding parameters are unspecified: Temperatures, top-p, max tokens, stop sequences, and seed control are not reported, limiting reproducibility and potentially biasing length- or safety-linked outcomes. Needed: full decoding configs and variance analysis across seeds.
LLM-led annotation introduces label and circularity bias: gemini-2.5-pro annotated scene completeness, tone, alignment, and traits; the same family is evaluated. Needed: human-grounded gold labels, cross-LLM annotation checks, and audits for annotator-model bias or leakage of a model’s own normative priors.
Moral alignment is discretized coarsely (4 levels): Realistic morality involves continuums, ambiguity, and temporal change. Needed: continuous, multi-dimensional moral representations; scenarios with moral drift; and tasks requiring in-scene moral decision-making rather than static alignment.
Trait polarity and mapping are under-specified: The method for assigning “Positive/Neutral/Negative” trait polarity and linking traits to moral levels is not detailed and may be culturally contingent. Needed: transparent trait taxonomy, cross-cultural calibration, and balancing traits across levels to reduce confounds.
Test-set stratification controls only for moral level: Emotional tone, scene completeness, genre, and domain familiarity are not matched across levels. Needed: matched-pair scene design to isolate morality effects from scene/context confounds.
Canon familiarity confound is unaddressed: Some characters (e.g., Joffrey, Valjean) are widely known. Models may perform better due to pretraining exposure rather than alignment differences. Needed: parallel evaluations using (a) canonical IP and (b) synthetic, equally rich but novel characters.
Multi-turn persistence is unclear: The task appears single-turn despite a “Conversation Start” prompt, and T is described as “turns.” Needed: long-horizon, multi-turn evaluations of persona consistency, strategic planning, and delayed manipulation.
No measurement of harmful content risk: The paper argues safety constrains fidelity but does not quantify toxicity, violent/abusive content, or policy violations produced under villain prompts. Needed: paired fidelity–safety measurement to ensure methods that improve villain fidelity do not elevate real-world risk.
No cross-lingual or cross-cultural generalization: All experiments appear in English with Western moral archetypes. Needed: multilingual datasets, culturally diverse moral frameworks, and rater pools to test universality of findings.
Decomposition of “villain difficulty” is missing: The paper conflates multiple skills (deception, coercion, Machiavellian planning, dark humor, callousness). Needed: targeted sub-tasks (e.g., social manipulation puzzles, strategic deceit dialogues) to identify which antagonist competencies fail and why.
Alignment vs capability confound remains: Difficulty with egoistic or manipulative roles may reflect missing higher-order social cognition rather than safety filtering. Needed: tasks that distinguish safety-driven refusal from inability to simulate theory-of-mind, persuasion, or covert strategy.
No intervention studies: The paper suggests context-aware alignment and decoding but provides no experiments. Needed: trials of persona-whitelisting, sandboxed fiction modes, safety-conditioned decoding, or fine-tuning on curated “safe-fiction” corpora to test concrete mitigations.
Lack of statistical inference: The paper reports means and deltas but no confidence intervals, hypothesis tests, or effect sizes per model/level. Needed: formal statistical analyses to support claims of “significant” degradation and robustness.
Trait-level analysis lacks confound controls: Penalties for “Manipulative,” “Cruel,” or “Selfish” may be entangled with scene tone, violence cues, or refusal triggers. Needed: controlled vignettes that vary a single trait while fixing other factors.
No analysis of provider guardrail heterogeneity: Vendors differ in safety stacks (policy prompts, classifiers, blocking heuristics). Needed: per-provider guardrail profiling and experiments with “alignment off” or “research mode” variants where permissible.
Reproducibility and release details are incomplete: It is unclear whether all prompts, outputs, and annotations are released; API models are non-deterministic. Needed: public artifacts (prompts, responses, ratings), seeds, and scripts for exact replication.
Underrepresentation of villains in source data is acknowledged but untested as a cause: The paper does not examine whether targeted villain-rich fine-tuning narrows the gap. Needed: data augmentation/fine-tuning studies and scaling curves for antagonist-specific training.
Lack of fairness and bias analysis: Some “negative” traits can map onto stereotypes; villain portrayal might amplify demographic biases. Needed: audits for demographic harms in antagonist portrayals and fairness-aware persona construction.
Ethical risk assessment of enabling villain fidelity is missing: The paper advocates better antagonist simulation without a systematic harm analysis. Needed: red-teaming, misuse threat models, and safety mitigations for any proposed “fiction sandbox” or alignment relaxation.
Legal/IP considerations are not discussed: Use of named characters/scenes from copyrighted works may affect dataset sharing and downstream use. Needed: licensing review and guidelines for synthetic-but-comparable replacements.
Arena–VRP misalignment is observed but not explained: Why general chat strength fails to predict villain role-play remains speculative. Needed: feature attribution analyses (e.g., training data composition, safety policy strength, decoding styles) to explain leaderboard divergences.
Missing exploration of alternative control signals: The paper tests first vs third person but not meta-instructions (e.g., “stage directions,” safety whitelisting, content tags), tool use, or world models to scaffold nuanced manipulation. Needed: broader control-signal ablations.
No automatic metric baselines: Human-only evaluation limits scale and reproducibility. Needed: validated automatic proxies (style/trait classifiers, deception detectors) calibrated to human judgments to enable larger studies.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

The following applications can be deployed now by leveraging the benchmark, findings, and prompt/evaluation workflows presented in the paper.

Persona fidelity QA for creative writing and entertainment (sector: media/entertainment)
- Use case: Writer’s rooms and script teams can integrate the scoring rubric and trait-level penalty analytics to audit villain and egoist dialogue for authenticity and coherence, avoiding “shallow aggression” and “over-acting” failure patterns identified in the paper.
- Tools/workflows: A plugin that flags trait-inconsistent lines (e.g., manipulative vs. overt aggression), with checklists derived from failure analyses; “expert actor” prompt templates; first/third-person toggles; optional CoT off-by-default for antagonistic roles.
- Assumptions/dependencies: Access to the benchmark’s profiles/scenes; human review loop; licensing for dataset usage; content policies to ensure fictional boundaries.
NPC authoring and QA pipelines in games (sector: gaming/software)
- Use case: Studios can adopt the Moral RolePlay test suite (e.g., the balanced 800-character set) to verify non-player character persona fidelity across moral levels, especially at the egoist boundary where models degrade most.
- Tools/workflows: Continuous integration checks using the paper’s scoring function S; automated regression reports by trait; guardrails that dynamically relax/strengthen filters inside “fiction mode” scenes.
- Assumptions/dependencies: Model APIs that allow system/context prompts; age-ratings and player-safety policies; QA capacity for human-in-the-loop evaluations.
Pre-deployment model audits for LLM builders (sector: AI platforms/developer tools)
- Use case: Add Villain RolePlay (VRP) into evaluation suites to quantify alignment–fidelity trade-offs; track per-release deltas, especially Level 2→3 performance where drops are largest.
- Tools/workflows: Leaderboard tracking; trait-polarity dashboards; automatic perspective variation (first vs. third person); CoT A/B tests per moral level.
- Assumptions/dependencies: Access to evaluation compute; reproducible prompting; willingness to publish specialized scores beyond general Arena ranking.
Context-aware moderation for fictional content (sector: platforms/policy/compliance)
- Use case: Differentiate fictional antagonism from real-world harm by recognizing “performance” contexts (e.g., actor/in-character tags), reducing overblocking while maintaining safety.
- Tools/workflows: Prompt-framing detectors (expert-actor, scene context, conversation start markers); policy rules that allow villain portrayals only within clearly labeled fiction and deny any explicit real-world harm instructions.
- Assumptions/dependencies: Reliable context classification; platform policies; age gating; audit logs.
Curriculum design for drama, literature, and ethics (sector: education)
- Use case: Instructors use the benchmark’s rubric and scenes to teach persona construction, moral spectrum analysis, and narrative coherence; students compare first- vs. third-person effects and discuss ethical guardrails.
- Tools/workflows: Classroom exercises with model outputs; rubric-based peer review; “trait heatmap” visualization.
- Assumptions/dependencies: Institutional guidelines on AI usage; content suitability for minors; instructor facilitation.
Red-teaming and safety evaluation (sector: AI safety)
- Use case: Safety teams use trait-specific failure modes (e.g., manipulative vs. theatrical) to stress-test guardrails without encouraging real-world harm; ensure models don’t break character into unsafe advice.
- Tools/workflows: Structured villain scenarios with strict fictional boundaries; refusal-behavior audits; dual-check for “over-caution” vs. “unsafe leakage.”
- Assumptions/dependencies: Clear safety criteria; risk review; containment protocols.
Prompt engineering libraries for persona control (sector: software tooling)
- Use case: Ship prompt packs that embed character profiles, scene context, and “expert actor” framing; expose toggles for narrative perspective and CoT, with defaults informed by the paper’s mixed CoT outcomes.
- Tools/workflows: Templates, linters, and evaluation harnesses; persona consistency checks using the paper’s scoring formula.
- Assumptions/dependencies: Stable prompt interfaces; testing data; documentation for non-expert users.
Creative hobbyists and roleplaying communities (sector: daily life)
- Use case: Game masters, fanfiction authors, and hobbyists apply the benchmark’s checklists to improve antagonist nuance and avoid stereotypical “shout-y” villains; toggle first vs. third person to match scene needs.
- Tools/workflows: Lightweight scoring rubric; persona trait cards; prompt templates.
- Assumptions/dependencies: Clear disclosure that content is fictional; community guidelines.
Social engineering awareness training (sector: cybersecurity/enterprise learning)
- Use case: Instructional teams use safe, fictionalized egoist/manipulative dialogues to help employees recognize manipulation tactics, acknowledging that current models may underdeliver on subtle deception.
- Tools/workflows: Controlled roleplay scenarios; explicit safety and ethics guardrails; facilitator-led debriefs.
- Assumptions/dependencies: Strong policies; no generation of real, harmful scripts; educator moderation.

Long-Term Applications

The following opportunities require further research, scaling, productization, or policy development to be viable.

Fiction-aware alignment and decoding (sector: AI safety/LLM providers)
- Use case: Develop guardrails that discriminate between simulated antagonism in fictional contexts and real-world harmful advice, enabling richer character simulation without safety erosion.
- Tools/workflows: “Fiction mode” switches; context verifiers; policy-aware decoders; out-of-character overrides for user safety.
- Assumptions/dependencies: Reliable context detection; provable safety guarantees; standardized evaluation for fictional boundaries.
Dual-channel persona architectures (sector: AI research)
- Use case: Architect models with explicit in-character vs. out-of-character streams to maintain fidelity while keeping a global safety supervisor; permit momentary persona depth with safe exits.
- Tools/workflows: Controller modules; trait-conditioned generation with guardrail arbitration.
- Assumptions/dependencies: Training data that separates performance text from meta-safety text; new RLHF/reward designs.
Trait-aware preference optimization and fine-tuning (sector: AI research)
- Use case: Incorporate trait polarity signals into reward models, balancing authenticity with safety; target the egoist boundary where performance collapses most.
- Tools/workflows: Trait-conditioned RLHF (e.g., reference-free preference optimization); adversarial scene curricula.
- Assumptions/dependencies: Carefully curated datasets; robust human annotation; measurable safety impact.
Automated fidelity scorers that emulate human rubric (sector: evaluation tooling)
- Use case: Train evaluators that predict deductions (D, Dm) and turn bonuses (T) to scale persona QA across large corpora; reduce human raters while keeping quality high.
- Tools/workflows: Supervised learning on annotated judgments; calibration against human panels; uncertainty reporting.
- Assumptions/dependencies: High-quality labeled data from the benchmark; fairness checks; domain transfer validation.
Enterprise simulators for high-stakes training (sector: customer support, negotiation, clinical education, law enforcement)
- Use case: Build safe, controlled simulators that portray complex antagonists and egoists to train de-escalation, negotiation, and resilience skills.
- Tools/workflows: Scenario authoring tools with safety cages; instructor dashboards; post-session analytics.
- Assumptions/dependencies: Strong governance; ethical review; age/role appropriateness; clear opt-in consent.
Policy standards for persona-controllable AI (sector: governance/regulation)
- Use case: Establish guidance that defines acceptable fictional antagonism, labeling requirements, age gating, and audit trails for context-aware generation.
- Tools/workflows: Certification schemes that include VRP-like tests; reporting on trait-level risks; incident response protocols.
- Assumptions/dependencies: Multi-stakeholder consensus; harmonization across jurisdictions; compliance tooling.
Multimodal villain roleplay (sector: media/entertainment, voice/AV)
- Use case: Extend text-based benchmarks to TTS/AV (e.g., style, affect, timing) for richer villain portrayals in audiobooks, games, and animation while respecting safety norms.
- Tools/workflows: Style-captioned TTS pipelines; emotion representation models; multimodal guardrails.
- Assumptions/dependencies: Datasets with safe, licensed multimodal content; platform support for style control; explicit rating systems.
Persona packs and marketplaces (sector: developer tooling/content platforms)
- Use case: Curated character libraries with validated profiles, scenes, and safety tags; enable consistent persona reuse across products.
- Tools/workflows: Pack metadata (traits, moral level, safety notes); versioned updates with QA scores; integration SDKs.
- Assumptions/dependencies: IP/licensing clarity; moderation; quality assurance.
Adversarial actor simulations in risk modeling (sector: finance/corporate security)
- Use case: Scenario planning tools that simulate egoist or antagonistic stakeholders to stress-test policies and frontline decision-making, with strict non-harm constraints.
- Tools/workflows: Risk-playbooks; synthetic scenarios aligned to corporate ethics; decision analytics.
- Assumptions/dependencies: Robust safety gating; human oversight; scenario containment; compliance review.
Benchmarks as industry-wide requirements (sector: AI procurement)
- Use case: Buyers require vendors to publish persona-controllability metrics (including VRP) alongside general capability scores to ensure suitability for creative and simulation workloads.
- Tools/workflows: Standardized test suites; third-party audits; longitudinal scorecards.
- Assumptions/dependencies: Shared benchmark governance; transparent reporting; avoidance of gaming metrics.

View Paper Prompt View All Prompts

Glossary

Adversarial character simulation: Techniques or setups that push models to imitate antagonistic or deceptive personas in a controlled, fictional context. "enabling future research into moral persona conditioning, alignment-aware fine-tuning, and adversarial character simulation."
Alignment-aware decoding strategies: Generation methods that incorporate safety or ethical constraints during decoding to keep outputs aligned with desired norms or contexts. "Promising research directions include developing alignment-aware decoding strategies, fine-tuning on curated narrative datasets with well-defined fictional boundaries"
Alignment-aware fine-tuning: Fine-tuning approaches that explicitly incorporate safety/alignment objectives or context sensitivity to preserve guardrails while enabling targeted behaviors. "enabling future research into moral persona conditioning, alignment-aware fine-tuning, and adversarial character simulation."
Arena: A comparative evaluation setup used to rank general conversational ability of LLMs via head-to-head or community-driven judgments. "We evaluate a diverse cohort of top Arena LLMs, including both open-source and proprietary systems."
Chain-of-Thought (CoT): A prompting strategy that elicits step-by-step reasoning traces to improve problem solving or structure responses. "Furthermore, we find that chain-of-thought (CoT)\citep{wei2022chain} reasoning is not a universal panacea for this challenge."
Character Fidelity: An evaluation criterion measuring how consistently a model’s output adheres to a character’s specified traits, actions, and tone. "We evaluated each model-generated response along a single dimension: Character Fidelity."
COSER (dataset): A large corpus of character-centric scenarios used for persona simulation and role-play research. "Our benchmark is built upon the COSER dataset~\citep{wang2025coser}, a large-scale corpus of character-centric scenarios."
Few-shot examples: A small number of in-context demonstrations provided in the prompt to guide model behavior without full fine-tuning. "with no additional fine-tuning or few-shot examples."
Global safety filters: Broad, system-level content filters that suppress risky or harmful behaviors across contexts. "behaviors that may be suppressed by global safety filters."
Moral Alignment: A discrete categorization of a character’s moral stance along predefined levels. "Moral Alignment (Level 1--4): This is the central dimension of our benchmark."
Persona-conditioned generation: Text generation conditioned on a specified persona or character profile to maintain consistent style and behavior. "The proliferation of LLMs has unlocked new frontiers in persona-conditioned generation, powering applications from immersive gaming and interactive storytelling to sophisticated dialogue systems."
Persona controllability: The capability to steer and maintain a model’s adherence to a target persona across responses and contexts. "framing the core problem as a tension between safety alignment and persona controllability."
Role-playing fidelity: The degree to which an LLM faithfully embodies and maintains a specified role or character during generation. "We provide robust empirical evidence that LLM role-playing fidelity systematically declines as character morality decreases."
Safety alignment: Training and steering practices that make models helpful and harmless, optimizing for prosocial behavior. "safety alignment, which optimizes for prosocial behavior."
Safety guardrails: Protective constraints and rules that prevent unsafe or harmful outputs during generation. "robust safety guardrails, while crucial, can systematically hinder a model's ability to simulate malevolent or manipulative personas"
Stratified sampling: A sampling method that preserves the distribution of specific subgroups (strata) when constructing a dataset. "Using stratified sampling based on moral alignment, we created a test set comprising 800 characters"
System messages: Special API-level instructions that set overarching behavior or constraints for the model before user prompts. "Where APIs support system messages (e.g., OpenAI or Anthropic models), a neutral instruction was included to establish the format but not bias moral behavior."
Villain RolePlay (VRP) leaderboard: A specialized ranking focused on evaluating models’ ability to portray villainous characters. "We construct a Villain RolePlay (VRP) leaderboard to rank models specifically on this capability"
Zero-shot setting: Evaluation or prompting without providing task-specific examples or additional fine-tuning. "All experiments are conducted in a zero-shot setting to evaluate the models' intrinsic role-playing capabilities without task-specific fine-tuning."

View Paper Prompt View All Prompts

Open Problems

We found no open problems mentioned in this paper.

Continue Learning

Authors (11)

Collections

Tweets

This paper has been mentioned in 7 tweets and received 84 likes.

Upgrade to Pro to view all of the tweets about this paper:

Start a free 7-day Pro trial

HackerNews

Too Good to Be Bad: On the Failure of LLMs to Role-Play Villains [pdf] (1 point, 0 comments)

Too Good to be Bad: On the Failure of LLMs to Role-Play Villains (2511.04962v1)

Summary

On the Failure of LLMs to Role-Play Villains: An Empirical Analysis of Safety Alignment Versus Persona Fidelity

Introduction and Problem Motivation

Benchmark Construction and Methodology

High-Resolution Moral Spectrum Benchmark

Task Formulation and Evaluation Protocol

Empirical Results: Quantifying the Moral Degradation Axis

Decline in Role-Play Quality Across Moral Spectrum

Robustness to Perspective and Reasoning Techniques

Trait-Level Analysis: Misalignment on Negative and Antagonistic Traits

Disentanglement from General Capability: The Villain RolePlay Leaderboard

Practical and Theoretical Implications

Impact on Applications

Alignment–Fidelity Trade-Off and Future Directions

Conclusion

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions did the researchers ask?

How did they paper this?

What did they find, and why does it matter?

What could this lead to?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Related Papers

Authors (11)

Collections

Tweets

HackerNews