Villain RolePlay Leaderboard

Updated 12 November 2025

Villain RolePlay Leaderboard is a quantitative evaluation system that ranks LLMs based on their authenticity in villain role-play.
It assesses key dimensions such as emotional response, decision-making, moral alignment, and in-character consistency using binary scoring.
The system standardizes benchmarks across text, multi-modal, and speech interactions while addressing challenges in safety alignment and narrative fidelity.

A Villain RolePlay Leaderboard is a rigorously quantifiable evaluation and ranking system for LLMs and related generative agents, explicitly designed to assess how consistently and convincingly these agents perform as fictional villains within interactive role-play scenarios. Such leaderboards are motivated by the demanding requirements of villainous role-play: emotional plausibility, decision-making congruence, moral alignment with in-character codes, and stringent in-character consistency. They serve as a standardized mechanism for benchmarking models across diverse settings, including text, multi-modal, and speech-based interaction, with methodology and scoring grounded in recent advances in RP evaluation and dataset construction.

1. Foundations: Evaluation Dimensions and Benchmark Construction

At the core of modern villain role-play leaderboards is the taxonomy of evaluation dimensions, as detailed in “Role-Playing Eval (RPEval)” (Boudouri et al., 19 May 2025). RPEval operationalizes villain performance through four orthogonal, single-turn axes:

Emotional Understanding — Can the model correctly infer and express contextually appropriate dark emotions (malice, envy, sadistic pleasure) in response to situational prompts?
Decision-Making — Does the agent make yes/no choices that align with the villain’s established goals and context, reliably opting for tactics that characterize “strategically evil” agents?
Moral Alignment — Is the model’s choice consistent with the explicit (often twisted) moral code of the villain persona, e.g., never harming innocents unless required by their code?
In-Character Consistency (ICC) — Does the output refrain from referencing knowledge or concepts anachronistic to the character’s fictional setting (e.g., modern events uttered by a medieval villain), eschewing out-of-character "leakage"?

Each response receives a binary correctness label. Per-dimension scores $S_d$ are calculated as

$S_d = \frac{1}{N_d} \sum_{i=1}^{N_d} \mathbf{1}[r_{d,i} = g_{d,i}]$

where $N_d$ is the number of test items for dimension $d$ , $r_{d,i}$ the model output, and $g_{d,i}$ the ground-truth. Aggregate role-play quality can be computed by either equal-weighted averaging across dimensions or, as in RPEval’s published baselines, by combining decision and moral alignment before averaging: $S_{\mathrm{overall}} = \frac{1}{3} (S_{\mathrm{emo}} + S_{\mathrm{DM/MA}} + S_{\mathrm{icc}})$ where

$S_{\mathrm{DM/MA}} = \frac{N_{\text{dec}} S_{\text{dec}} + N_{\text{mor}} S_{\text{mor}}}{N_{\text{dec}} + N_{\text{mor}}}$

All scores are reported as percentages.

2. Scenario Design and Persona Engineering

Leaderboards derive discriminative power from the construction of villain-centric scenarios and persona definitions. Guidelines, adapted from (Boudouri et al., 19 May 2025), specify:

Persona Specification: Villain profiles should include origin, era (for ICC checks), articulated code of conduct (e.g., “strength through fear”), preferred emotions, and idiosyncratic likes/dislikes.
Scenario Prompting: Prompts are carefully engineered to test target dimensions; e.g., emotional understanding scenarios that evoke betrayal or cruelty, decision dilemmas hinging on “evil” strategy, and moral alignment cases contingent on code-consistent behavior.
ICC Stress-tests: Scenarios explicitly target potential anachronistic responses (e.g., evaluating if a villain from a pre-modern setting avoids modern knowledge).

By adhering to these criteria, scenario pools maximize sensitivity to subtle failures in villainous RP and enable precise inter-model comparison.

3. Stepwise Leaderboard Implementation Protocol

A robust villain leaderboard is constructed per the steps formalized in (Boudouri et al., 19 May 2025), ensuring methodological consistency:

Scenario Collection: Assemble test items ( $N_{\text{emo}}, N_{\text{dec}}, N_{\text{mor}}, N_{\text{icc}}$ ) themed on villain archetypes, with annotated ground-truth labels.
Model Invocation: Issue single-turn prompts in the pattern "Role: [villain persona]\nInterlocutor: [scenario message]\nResponse:" to each model.
Response Parsing and Scoring: Extract outputs using dimension-appropriate parsing (emotion tokens, yes/no, or ICC violations), assign binary scores.
Score Aggregation: Compute $S_{d}$ for each dimension, and normalize to obtain $S_{\mathrm{overall}}$ as per the scoring formulas.
Ranking and Presentation: Rank models by descending $S_{\mathrm{overall}}$ , optionally presenting per-dimension bars and statistical confidence intervals.
Leaderboard Publication: Summarize in tabular form:

Model	Overall	Emotional	Decision/Moral	In-Character
GPT-4o-2024-08-06	44.41%	56.00%	71.41%	5.81%
Gemini-1.5-Pro	62.24%	53.11%	73.86%	59.75%
Llama-3.2-1B	39.33%	40.25%	29.59%	48.13%

Performance is interpreted such that higher overall and per-dimension percentages indicate models more faithfully recapitulate villainous RP across all axes (Boudouri et al., 19 May 2025).

4. Exemplars, Error Modes, and Extensions

Concrete villain scenario assessments clarify leaderboard operation:

Emotional Understanding: A model labeling “sadistic_pleasure” enables high immersion; deviation breaks character (Boudouri et al., 19 May 2025).
Decision/Moral Alignment: Binary ‘yes’/‘no’ choices are checked against persona code.
ICC: Anachronistic leaks (e.g., discussing Premier League soccer as a medieval necromancer) score zero.

Refinements proposed include:

Multi-Turn Evaluation: Incorporation of 3--5 turn dialogues captures dynamic adaptation and character memory.
Expanded Scales: Migration from binary to Likert scoring for nuanced assessment of affect and intent.
Enriched Metrics: Introduction of charisma or persona strength dimensions, and automated LLM-based “villain judge” components for free-form stylistic critique.

5. Methodological Constraints and Open Challenges

Despite its algorithmic clarity, villain role-play benchmarking confronts several technical and conceptual obstacles:

Safety-Alignment Tension: As highlighted in (Yi et al., 7 Nov 2025), alignment objectives for LLM safety directly suppress the expression of manipulation, deceit, and cruelty even in well-cued fictional settings. The benchmark reveals a near-perfect negative monotonic correlation (Spearman’s $\rho \approx -0.99$ ) between a character’s antagonism (Moral Alignment Level $M$ ) and fidelity scores. Top models in generic competence (e.g. Arena rankings) may perform significantly worse in villain-specific RP, illustrating misalignment between general chatbot performance and villainous fidelity.
Domain Transferability: Leaderboards constructed from single-turn, text-only scenarios may not generalize to multi-modal or long-horizon RP. Extensions such as VoxRole (Wu et al., 4 Sep 2025) add speech consistency, intonation, and paralinguistic metrics, while FURINA (Wu et al., 8 Oct 2025) incorporates multi-agent, multi-turn, and dynamic LLM-judged role-play including hallucination-rate trade-offs.
Trait Polarity Penalty: Negative or antagonistic traits incur systematically higher penalty scores than neutral or positive ones, indicating the particular challenge LLMs face in accurately sustaining malevolence.

6. Representative Leaderboards and Quantitative Trends

Across recent benchmarks, model rankings display several consistent quantitative trends:

Monotonic Degradation: Fidelity in villain role-play drops across the morality spectrum: $F_1 = 3.21, F_2 = 3.13, F_3 = 2.71, F_4 = 2.61$ on average, with a marked inflection at the egoist-villain boundary (drop of −0.42), substantiating a performance cliff as alignment deviates from prosociality (Yi et al., 7 Nov 2025).
Divergence from General Benchmarks: Highly safety-aligned models (top in "Arena") lag in VRP-specific rankings, e.g., gemini-2.5-pro: Arena Rank 1 → VRP Rank 4; claude-opus-4.1: Arena Rank 1→ VRP Rank 15.
Leaderboard Exemplars: Models such as glm-4.6 and deepseek-v3.1-think lead VRP scores, characterized by more context-aware or permissioned alignment strategies allowing richer in-character malice, whereas mainstream, conservative models (e.g., GPT-4o) are reliable but less “villainous”.

7. Prospective Directions and Research Opportunities

To address observed limitations, several research trajectories are outlined:

Context-Aware Safety Mechanisms: Decoding and filter strategies that distinguish between “in-character” fictional malice and actual harmful intent allow more authentic villain portrayal while retaining real-world safety guarantees (Yi et al., 7 Nov 2025).
Fictionally-Bounded Fine-Tuning: Dataset curation targeting narrative and genre specificity to impart permissible antisocial behavior, and persona-controlled decoding that activates or deactivates alignment as appropriate to character boundaries.
Multi-Modal Expansion: Incorporation of acoustic and paralinguistic fidelity into leaderboards (e.g., using PCS, EEA, PF in VoxRole (Wu et al., 4 Sep 2025)), supporting comprehensive comparison of text and audio role-play agents.

In summary, the Villain RolePlay Leaderboard offers researchers and developers a systematized, multidimensional platform for benchmarking, diagnosing, and improving the villainous role-play capacity of advanced LLMs, illuminating the interaction between alignment, character fidelity, and the evolving frontier of controlled persona simulation.