RoleCS Dataset: Strategic Support Dialogue
- RoleCS is a synthetic, strategy-rich dataset simulating professional customer support dialogue with explicit stages and tactics.
- It employs a five-agent role-playing framework that ensures strategic alignment, persona consistency, and rigorous quality filtering.
- Empirical evaluations demonstrate notable improvements in dialogue coherence and strategic response accuracy when LLMs are fine-tuned with RoleCS.
RoleCS is a large-scale, synthetic, strategy-rich corpus for training and evaluating LLMs in the domain of customer support dialogue. Based entirely on role-playing between LLM-powered agents, RoleCS is constructed to encode explicit conversational structure, strategic alignment, and persona consistency, reflecting the professional standards found in real-world customer service, particularly in accordance with COPC guidelines. The dataset is designed to address the scarcity of accessible, high-quality, multi-turn support data for fine-tuning next-generation conversational agents, providing both comprehensive coverage of standard service scenarios and detailed guidance grounded in a staged support framework.
1. Conversational Framework and Strategy Taxonomy
RoleCS is defined by a structured Customer Support Conversation (CSC) framework that divides the dialogue into five sequential stages: Connecting, Identifying, Exploring, Resolving, and Maintaining. Each stage is operationalized through a set of twelve discrete strategies—Greeting (GT), Identity Verification (IV), Emotional Management (EM), Restatement/Paraphrasing (RP), Problem Refinement (PR), Providing Suggestions (PS), Information Delivery (ID), Resolution Implementation (RI), Feedback Request (FR), Appreciation & Closure (AC), Relationship Continuation (RC), and Others. Every supporter utterance is explicitly aligned with exactly one strategy, ensuring transparent tactic tracking and facilitating downstream model training and evaluation (Zhu et al., 6 Aug 2025).
2. Synthetic Data Generation via Multi-Agent Role-Playing
The core of RoleCS is a five-agent simulation pipeline. The data generation process is strictly governed by a Planner agent that samples one of seven customer topics (e.g., technical support, account query, security review) and a customer persona from approximately 1,948 automatically extracted profiles clustered and deduplicated from 15,980 real transcripts by cosine similarity (threshold > 0.85). This combination yields a micro-scenario description and communication goal that anchor each dialogue instance.
Subsequent interaction involves sequential alternation between simulated Supporter and Customer agents. The Supporter Assistant recommends a strategy at every turn, upon which the Supporter generates a tactic-aware response. In parallel, the Customer Assistant and Customer agent ensure persona-consistent, context-appropriate customer replies. Prompts for all agents are crafted to robustly enforce strategy coverage, persona realism, and scenario coherence. This pipeline enforces a minimum of one Greeting, one Identity Verification, and one Appreciation & Closure per conversation, as well as strict turn alternation (Zhu et al., 6 Aug 2025).
3. Dataset Composition and Statistics
RoleCS comprises 11,232 dialogues filtered from an initial 13,636 (Cartesian product of topics and profiles), after multi-stage automated filtering and manual spot checking. The resulting corpus contains 263,580 utterances, with an average of 23.47 turns per dialogue (12.23 supporter turns, 66.98 words per supporter turn; 11.23 customer turns, 46.43 words per customer turn). Structural and quality control mechanisms excise dialogues with fewer than 10 or more than 50 utterances, any single utterance exceeding 500 characters, or a turn imbalance beyond 2:1. A LLM (Qwen2.5-72B) is employed as a holistic quality gate for coherence, strategy fidelity, and empathy (Zhu et al., 6 Aug 2025).
The prevalence of the twelve strategies in the corpus is high—most dialogues employ eight to twelve distinct types. Lexical diversity is quantified by Distinct-2 (22.35%), and semantic diversity by mean TF-IDF cosine (≈0.12 between random dialogue pairs), evidencing both varied wording and scenario trajectories.
| Statistic | Value | Notes |
|---|---|---|
| Total dialogues | 11,232 | After quality and structural filtering |
| Total utterances | 263,580 | Includes both supporter and customer |
| Avg. turns per dialogue | 23.47 | ~12 supporter, ~11 customer |
| Distinct customer personas | ≈1,948 | Extracted from real transcripts |
| Strategy coverage rate | ≈0.92 | Mean (# used strategies)/12 |
| Lexical diversity (Distinct-2) | 22.35% | Distinct 2-grams |
4. Quality Control and Persona Engineering
Three filtering regimes protect corpus integrity. Structural filtering eliminates outliers in length and turn alternation; strategy-presence checks mandate that key strategies are used in each conversation; and a model-based filter excludes dialogues rated as low-quality on empathy, coherence, or strategic fit. Every dialogue alternates strictly between supporter and customer, modeling authentic conversational flow.
Customer personas originate from the distillation of real-world transcripts: an LLM extracts profile attributes (age, occupation, communication style, emotional baseline), followed by clustering and de-duplication at cosine similarity >0.85, and reconversion into free-text for flexible role-playing. Semantic alignment between customer utterances and their assigned profiles is measured via cumulative token overlap or TF-IDF cosine (Zhu et al., 6 Aug 2025).
5. Metrics and Evaluation Protocol
Coverage of the CSC strategies is formalized as
with an observed average of ≈0.92 per dialogue. Lexical variety is measured via Distinct-N metrics (# distinct n-grams over total n-grams), and persona-strategy alignment via TF-IDF-based cosine similarity.
Model evaluation on RoleCS leverages standard metrics—BLEU-n, ROUGE-L, BERTScore, BLEURT—as well as strategy-prediction accuracy (ACC). Fine-tuning large instruction-tuned models such as Qwen2.5-72B-Instruct, LLaMA3.1-8B/70B, and Qwen2.5-7B on RoleCS yields substantial empirical improvements on the human-annotated CSConv benchmark set. For instance, Qwen2.5-72B saw BLEU-2 rise from 8.61 to 12.15, BLEU-4 from 3.23 to 5.32, ROUGE-L from 5.41 to 7.97, and strategy ACC from 37.22% to 43.29%. Similar proportional gains were validated for the LLaMA family (Zhu et al., 6 Aug 2025).
In human and LLM-based evaluations, RoleCS-finetuned models outperform both zero-shot baselines and competing models (DeepSeek-R1) across accuracy, empathy, and helpfulness, with human raters’ Fleiss’ κ with GPT-4o at 0.658, indicating substantial agreement.
6. Representative Dialogue Example and Utility
The following excerpt illustrates strategic annotation and alternation, with explicit labels for supporter utterances:
- Customer: “Hi – I noticed a \$25 ATM fee last night that I didn’t expect.”
- Supporter (GT): “Good evening! Thank you for calling FinBank Support. My name is Li Mei, how may I assist you today?”
- Customer: “I used an ATM near my office and saw the fee. I’m not sure why it applied.”
- Supporter (IV): “May I have your account number and the last four digits of your registered phone for verification?”
- Supporter (EM): “I understand it’s frustrating to incur extra charges unexpectedly. Let’s get to the bottom of this.”
- Supporter (PS): “As a workaround, I can credit one fee back to your account right now.”
- Customer: “No, that’s perfect—thank you!”
- Supporter (AC): “Thank you for choosing FinBank. Have a great evening, and feel free to call us anytime.”
RoleCS thus offers a domain-agnostic, taxonomy-rich, persona-anchored synthetic corpus for support dialogue, bridging the constraints of commercial data and the demands of research. Empirical evidence demonstrates that models trained on RoleCS are more likely to generate coherent, strategic, and empathetic support responses, closely approximating industry best practices and serving as a resource for future work in dialogue systems, strategic communication, and customer-experience management (Zhu et al., 6 Aug 2025).