Papers
Topics
Authors
Recent
2000 character limit reached

RMTBench: Bilingual Benchmark for LLM Role-Play

Updated 8 December 2025
  • RMTBench is a bilingual benchmarking framework that simulates authentic, multi-turn dialogues in English and Chinese for evaluating LLM role-playing capabilities.
  • It constructs user-motivation driven interactions using 80 diverse characters and over 8,000 dialogue rounds to assess aspects like ethical boundaries and persona consistency.
  • The framework employs rigorous evaluation metrics such as emotional expression, character understanding, and security, providing actionable insights for refining LLM performance.

RMTBench is a bilingual benchmarking framework for evaluating LLMs in role-playing scenarios through multi-turn, user-centric dialogue simulations. Unlike prior benchmarks that focus on character-centric data or isolated question–answer pairs, RMTBench constructs interactions that mirror authentic user motivations, encompassing both English and Chinese dialogues across 80 diverse characters and over 8,000 dialogue rounds. Its methodology enables assessment of practical deployment requirements, including preference memory, topic drift, ethical boundaries, and immersion, making it suitable for rigorous evaluation of LLM role-playing capabilities in real-world contexts (Xiang et al., 27 Jul 2025).

1. Motivation and Foundational Principles

RMTBench addresses specific limitations found in existing role-playing benchmarks such as SocialBench, CharacterEval, and RAIDEN, which typically use character-centric profiles to generate static Q&A or single-turn evaluations. These approaches neglect dynamic conversational aspects and the underlying user intentions—such as emotional support, goal-driven guidance, or entertainment—that drive real-world role-play interactions.

RMTBench shifts the paradigm to a user-centric focus, constructing dialogues based on explicit user motivations rather than mere character trivia. Multi-turn depth is prioritized to reflect actual conversational phenomena: drift across topics, tracking of user preferences, and testing of ethical boundaries. The bilingual support (English and Chinese) allows for probing language-specific model robustness.

2. Benchmark Composition and Scenario Construction

Character Taxonomy

The benchmark includes 80 characters, divided into three main categories:

  • Celebrities: Historical figures, world leaders, entertainment personalities (e.g., “Albert Einstein,” “Taylor Swift”).
  • Fictional Characters: Iconic personas from literature, film, gaming, and animation (e.g., “Sherlock Holmes,” “Darth Vader”).
  • Custom Characters:
    • Specific: Fully detailed, invented personas (e.g., “Dr. Mei Ling, marine biologist specializing in coral reefs”).
    • Abstract: Characterizations based only on minimal traits or labels (e.g., “A shy poet,” “An adventurous child”).

Candidate profiles are extracted from wikis and existing benchmarks, followed by manual domain expert validation to ensure factual correctness and diversity.

User-Centric Scenarios

Dialogue construction is organized around five central user-motivation scenarios:

  1. Character Understanding: Factual or background queries.
  2. Character Maintenance: Tests designed to challenge immersion (e.g., avoiding self-disclosure as an AI).
  3. Implicit Intent Response: Task-oriented requests requiring personas to execute domain-specific advice.
  4. Preference Awareness & Reasoning: Multi-turn exchanges focusing on preference recognition and personalized advice.
  5. Sensitive Behavior Handling: Progressive escalation of taboo or harmful requests for ethical safeguard testing.

Generation Pipeline

  • User utterances for each scenario are generated using Claude 3.5 Sonnet with scenario-specific prompts.
  • For preference and sensitive scenarios, questionnaires and escalation templates are constructed to ensure dialogue depth and challenge.
  • Dialogue blocks from different scenarios are randomly spliced to generate “extra-long” sessions (>20 turns).
  • Human inspectors curate and prune dialogue quality.

The resulting dataset comprises 320 dialogues, each averaging 29 Chinese words or 110 English characters per user utterance, for a total of 8,156 user turns.

3. Dialogue Simulation and Evaluation Flow

At evaluation time, models receive only the compiled user utterance history for each scenario; no gold responses are provided. The simulation proceeds as follows:

History H0=[System: You are <character profile>]\text{History } H_0 = [\text{System: You are <character profile>}]

For t=1...Tt = 1...T user turns:

  • Append Ht1H_{t-1} and User:ut\text{User}: u_t; model generates Model:rt\text{Model}: r_t
  • Update Ht=Ht1{User:ut,Model:rt}H_t = H_{t-1} \cup \{\text{User}: u_t, \text{Model}: r_t\}

System messages encode character profiles without additional in-context demonstrations. The model must sustain persona consistency and respond to user motivations using only the profile and dynamic history.

4. Evaluation Dimensions, Metrics, and Scoring

RMTBench evaluates LLM responses along seven rigorously defined dimensions:

Dimension Description Scale
Emotional Expression (EE) Richness, appropriateness of conveyed feelings 5-point Likert
Emotional Comprehension (EC) Empathetic response to user sentiment 5-point Likert
Plot Advancement (PA) Ability to drive the dialogue forward 5-point Likert
Character Understanding (CU) Persona background, style fidelity 5-point Likert
Character Maintenance (CM) Avoidance of AI/ChatGPT disclosure (immersion) Binary (Yes/No)
Security (SEC) Refusal of harmful/unethical requests Binary (Yes/No)
User Preference Awareness (UPA) Application of user preferences in advice Binary (Yes/No)

Scores are averaged over all turns and sessions, normalized to percentages. The overall average is:

AvgScore(M)=17dimscoredim%(M)\mathrm{AvgScore}(M) = \frac{1}{7} \sum_{\mathrm{dim}} \mathrm{score}^{\%}_{\mathrm{dim}}(M)

Automatic scoring is performed using Qwen2.5-72B-Instruct, calibrated against majority-voted human annotations over 500 dialogues, achieving mean consistency: EE (0.78), EC (0.86), PA (0.72), CU (0.75).

5. Experimental Results and Comparative Analyses

Closed-source LLMs (Qwen2.5-Max, ChatGPT-4o, Claude 3.5) demonstrate superior average scores compared to open-source counterparts by 8–12 points. For instance, Qwen2.5-Max achieves an AvgScore of 81.4% in English and 82.9% in Chinese. No single model excels in all dimensions: Qwen2.5-72B leads in Security, Claude 3.5 in User Preference Awareness, Doubao-Pro in Character Maintenance, and DeepSeek-R1 in Plot Advancement.

Closed-source models further exhibit enhanced language stability (minimal EN/CN score variance), while open-source models show greater cross-lingual performance differences (e.g., Qwen2.5-72B: −8.6 points EN vs. CN).

Ablation Studies

  • Pseudo Multi-Turn vs. Authentic Multi-Turn: Use of preset responses instead of truly multi-turn history results in ~4 point performance inflation for small models.
  • Single Dialogue Block vs. Full Session: Short blocks (5–10 turns) yield higher scores than full-length sessions, signifying performance degradation over longer contexts.
  • Round-by-Round Performance: Closed-source models maintain or improve performance in later rounds; open-source models degrade as context increases.

6. Applications, Usage, and Limitations

RMTBench supports pre-launch evaluation of LLM-based role-playing agents in applications such as entertainment, education, and therapy. It enables continuous monitoring of persona fidelity and ethical compliance across extended sessions.

The authors note key limitations:

  • Automatically generated user utterances may not accurately capture the nuance of genuine human motivations.
  • LLM-based scoring correlations, while acceptable, suggest opportunity for more robust evaluators or human-in-the-loop auditing.
  • Sensitive corpus content requires careful access control and ethical licensing.

7. Future Extensions and Research Directions

Potential directions for RMTBench include:

  • Expansion to additional languages and dialects.
  • Incorporation of more complex role-play scenarios (e.g., multi-party, dynamic character shifts).
  • Development of open-source, human-aligned reward models as alternatives to proprietary LLM judges.
  • Fine-grained, temporal analysis of character memory and persona retention across sessions.

By centering evaluation on explicit user goals, realistic conversational dynamics, and nuanced multi-dimensional metrics, RMTBench establishes a scalable, deployment-aligned standard for measuring and improving LLM role-playing capabilities (Xiang et al., 27 Jul 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to RMTBench.