Papers
Topics
Authors
Recent
Search
2000 character limit reached

RoleCS Dataset: Strategic Support Dialogue

Updated 18 May 2026
  • RoleCS is a synthetic, strategy-rich dataset simulating professional customer support dialogue with explicit stages and tactics.
  • It employs a five-agent role-playing framework that ensures strategic alignment, persona consistency, and rigorous quality filtering.
  • Empirical evaluations demonstrate notable improvements in dialogue coherence and strategic response accuracy when LLMs are fine-tuned with RoleCS.

RoleCS is a large-scale, synthetic, strategy-rich corpus for training and evaluating LLMs in the domain of customer support dialogue. Based entirely on role-playing between LLM-powered agents, RoleCS is constructed to encode explicit conversational structure, strategic alignment, and persona consistency, reflecting the professional standards found in real-world customer service, particularly in accordance with COPC guidelines. The dataset is designed to address the scarcity of accessible, high-quality, multi-turn support data for fine-tuning next-generation conversational agents, providing both comprehensive coverage of standard service scenarios and detailed guidance grounded in a staged support framework.

1. Conversational Framework and Strategy Taxonomy

RoleCS is defined by a structured Customer Support Conversation (CSC) framework that divides the dialogue into five sequential stages: Connecting, Identifying, Exploring, Resolving, and Maintaining. Each stage is operationalized through a set of twelve discrete strategies—Greeting (GT), Identity Verification (IV), Emotional Management (EM), Restatement/Paraphrasing (RP), Problem Refinement (PR), Providing Suggestions (PS), Information Delivery (ID), Resolution Implementation (RI), Feedback Request (FR), Appreciation & Closure (AC), Relationship Continuation (RC), and Others. Every supporter utterance is explicitly aligned with exactly one strategy, ensuring transparent tactic tracking and facilitating downstream model training and evaluation (Zhu et al., 6 Aug 2025).

2. Synthetic Data Generation via Multi-Agent Role-Playing

The core of RoleCS is a five-agent simulation pipeline. The data generation process is strictly governed by a Planner agent that samples one of seven customer topics (e.g., technical support, account query, security review) and a customer persona from approximately 1,948 automatically extracted profiles clustered and deduplicated from 15,980 real transcripts by cosine similarity (threshold > 0.85). This combination yields a micro-scenario description and communication goal that anchor each dialogue instance.

Subsequent interaction involves sequential alternation between simulated Supporter and Customer agents. The Supporter Assistant recommends a strategy at every turn, upon which the Supporter generates a tactic-aware response. In parallel, the Customer Assistant and Customer agent ensure persona-consistent, context-appropriate customer replies. Prompts for all agents are crafted to robustly enforce strategy coverage, persona realism, and scenario coherence. This pipeline enforces a minimum of one Greeting, one Identity Verification, and one Appreciation & Closure per conversation, as well as strict turn alternation (Zhu et al., 6 Aug 2025).

3. Dataset Composition and Statistics

RoleCS comprises 11,232 dialogues filtered from an initial 13,636 (Cartesian product of topics and profiles), after multi-stage automated filtering and manual spot checking. The resulting corpus contains 263,580 utterances, with an average of 23.47 turns per dialogue (12.23 supporter turns, 66.98 words per supporter turn; 11.23 customer turns, 46.43 words per customer turn). Structural and quality control mechanisms excise dialogues with fewer than 10 or more than 50 utterances, any single utterance exceeding 500 characters, or a turn imbalance beyond 2:1. A LLM (Qwen2.5-72B) is employed as a holistic quality gate for coherence, strategy fidelity, and empathy (Zhu et al., 6 Aug 2025).

The prevalence of the twelve strategies in the corpus is high—most dialogues employ eight to twelve distinct types. Lexical diversity is quantified by Distinct-2 (22.35%), and semantic diversity by mean TF-IDF cosine (≈0.12 between random dialogue pairs), evidencing both varied wording and scenario trajectories.

Statistic Value Notes
Total dialogues 11,232 After quality and structural filtering
Total utterances 263,580 Includes both supporter and customer
Avg. turns per dialogue 23.47 ~12 supporter, ~11 customer
Distinct customer personas ≈1,948 Extracted from real transcripts
Strategy coverage rate ≈0.92 Mean (# used strategies)/12
Lexical diversity (Distinct-2) 22.35% Distinct 2-grams

4. Quality Control and Persona Engineering

Three filtering regimes protect corpus integrity. Structural filtering eliminates outliers in length and turn alternation; strategy-presence checks mandate that key strategies are used in each conversation; and a model-based filter excludes dialogues rated as low-quality on empathy, coherence, or strategic fit. Every dialogue alternates strictly between supporter and customer, modeling authentic conversational flow.

Customer personas originate from the distillation of real-world transcripts: an LLM extracts profile attributes (age, occupation, communication style, emotional baseline), followed by clustering and de-duplication at cosine similarity >0.85, and reconversion into free-text for flexible role-playing. Semantic alignment between customer utterances and their assigned profiles is measured via cumulative token overlap or TF-IDF cosine (Zhu et al., 6 Aug 2025).

5. Metrics and Evaluation Protocol

Coverage of the CSC strategies is formalized as

coverage=# distinct CSC strategies used12\text{coverage} = \frac{\#\text{ distinct CSC strategies used}}{12}

with an observed average of ≈0.92 per dialogue. Lexical variety is measured via Distinct-N metrics (# distinct n-grams over total n-grams), and persona-strategy alignment via TF-IDF-based cosine similarity.

Model evaluation on RoleCS leverages standard metrics—BLEU-n, ROUGE-L, BERTScore, BLEURT—as well as strategy-prediction accuracy (ACC). Fine-tuning large instruction-tuned models such as Qwen2.5-72B-Instruct, LLaMA3.1-8B/70B, and Qwen2.5-7B on RoleCS yields substantial empirical improvements on the human-annotated CSConv benchmark set. For instance, Qwen2.5-72B saw BLEU-2 rise from 8.61 to 12.15, BLEU-4 from 3.23 to 5.32, ROUGE-L from 5.41 to 7.97, and strategy ACC from 37.22% to 43.29%. Similar proportional gains were validated for the LLaMA family (Zhu et al., 6 Aug 2025).

In human and LLM-based evaluations, RoleCS-finetuned models outperform both zero-shot baselines and competing models (DeepSeek-R1) across accuracy, empathy, and helpfulness, with human raters’ Fleiss’ κ with GPT-4o at 0.658, indicating substantial agreement.

6. Representative Dialogue Example and Utility

The following excerpt illustrates strategic annotation and alternation, with explicit labels for supporter utterances:

  • Customer: “Hi – I noticed a \$25 ATM fee last night that I didn’t expect.”
  • Supporter (GT): “Good evening! Thank you for calling FinBank Support. My name is Li Mei, how may I assist you today?”
  • Customer: “I used an ATM near my office and saw the fee. I’m not sure why it applied.”
  • Supporter (IV): “May I have your account number and the last four digits of your registered phone for verification?”
  • Supporter (EM): “I understand it’s frustrating to incur extra charges unexpectedly. Let’s get to the bottom of this.”
  • Supporter (PS): “As a workaround, I can credit one fee back to your account right now.”
  • Customer: “No, that’s perfect—thank you!”
  • Supporter (AC): “Thank you for choosing FinBank. Have a great evening, and feel free to call us anytime.”

RoleCS thus offers a domain-agnostic, taxonomy-rich, persona-anchored synthetic corpus for support dialogue, bridging the constraints of commercial data and the demands of research. Empirical evidence demonstrates that models trained on RoleCS are more likely to generate coherent, strategic, and empathetic support responses, closely approximating industry best practices and serving as a resource for future work in dialogue systems, strategic communication, and customer-experience management (Zhu et al., 6 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RoleCS Dataset.