Papers
Topics
Authors
Recent
2000 character limit reached

TeleSalesCorpus: Sales Dialogue Dataset

Updated 22 November 2025
  • TeleSalesCorpus is a domain-specific dataset characterized by annotated, multi-turn telemarketing dialogues and simulated scripts.
  • It integrates real-world call transcripts and LLM-generated simulations, enabling detailed analysis of sales strategies and agent behaviors.
  • Its structured annotations, comprehensive metadata, and PII redaction support robust benchmarking for advanced conversational AI research.

TeleSalesCorpus is a domain-specific, large-scale collection of dialogue resources tailored for the development, training, and benchmarking of intelligent telemarketing agents. It addresses the data scarcity in goal-driven, persuasive conversational AI by providing multi-turn, scenario-grounded sales conversations, enabling fine-grained modeling of sales strategies, factual faithfulness, and customer-adaptive dialogue. TeleSalesCorpus design, composition, and annotation draw from both real-world and semi-synthetic assets and underpin leading research in reinforcement learning, LLMs, and autonomous call agent pipelines in the telesales domain (Zhang et al., 15 Nov 2025, Dao et al., 30 Jun 2025, Kaewtawee et al., 5 Sep 2025).

1. Dataset Structure and Statistical Properties

TeleSalesCorpus embodies two primary forms as referenced in research:

  • CallCenterEN (Real-World TeleSalesCorpus): Comprises 91,706 call transcripts (≈ 10,448 hours of audio prior to privacy filtering) from authentic call centers. It includes both inbound (91.3%, 83,724 calls) and outbound (8.7%, 7,982 calls) service and sales calls, representing Indian, Filipino, and American accents. Detailed domain labeling divides dialogues among nine commercial topics—Medicare, home services, automotive, insurance, and more—with Medicare_inbound (67.1%) dominant (Dao et al., 30 Jun 2025).
  • TeleSalesCorpus (LLM-Simulated/Syn-Data): Generated as a semi-synthetic dataset via LLM simulation on distilled real call assets; 2,000 high-fidelity, multi-turn dialogues are produced, spanning the canonical sales pipeline: Opening → Business_Analysis → Promotion_Introduction → UI_Guidance → Objection_Handling → Polite_Closing. Dialogue length exhibits a long-tail: ~15% have 4–6 turns, 50% have 7–12, the rest reach 15–20 turns. Each dialogue is annotated at the turn-chunk level by dialogue state (Zhang et al., 15 Nov 2025):

T=1Ni=1NTi,Utotal=i=1N2Ti\overline{T} = \frac{1}{N}\sum_{i=1}^N T_i, \quad U_{\mathrm{total}} = \sum_{i=1}^N 2T_i

Speaker distributions are computed as Ps=UsUtotalP_s = \frac{U_s}{U_{\mathrm{total}}} with s{User,Agent}s \in \{\text{User}, \text{Agent}\}.

For voice AI agent research, in-house domain-specific corpora of ~5,000 calls (of which 1,000 are annotated and sampled) supplement TeleSalesCorpus, enabling analysis at sub-dialogue levels (opening, pitch, objection, closing) and clustering of agent behaviors (Kaewtawee et al., 5 Sep 2025).

2. Collection Pipeline and Annotation Schema

2.1. Real-World Data Acquisition and Redaction

CallCenterEN transcripts are derived from raw telephony audio transcribed using AssemblyAI’s premium ASR, providing word-level timestamps and confidence scores (overall 86–98%). 0.1% of transcripts undergo human QA, yielding WER = 3.87%, corresponding to ≈96.13% accuracy.

A two-stage personally identifiable information (PII) removal pipeline is employed:

  1. Automated tagging using heuristics and dictionaries for numbers, dates, financial, and medical entities.
  2. Manual review in the QA subset, ensuring systematic redaction across 20+ categories (personal, financial, medical, legal, technical, etc.), compliant with CCPA and India’s DPDP 2023.

The resultant format is per-call JSON objects containing: call_id, call_type (inbound/outbound), domain, agent_accent, customer_locale, duration_seconds, overall_confidence, and a sequence of word-level entries (word, start_time, end_time, word_confidence, speaker_role) (Dao et al., 30 Jun 2025).

2.2. Semi-Synthetic Dialogue Simulation

TeleSalesCorpus (Syn-Data) leverages a three-agent simulation:

  • User Agent: Simulates customers with predefined personas.
  • Sales Agent: Generates agent turns guided by dialogue state.
  • Dialogue Manager: Manages state transitions and instantiates prompts using real dialogue-state-indexed chunks from vector stores.

Scenarios are seeded from real data to define promotion rules, and GPT-4 is prompted to author detailed product knowledge bases per scenario. Automatic and manual filtering discards short, redundant, or ungrounded conversations; expert curation ensures factual coherence and realism (Zhang et al., 15 Nov 2025).

2.3. Phase and Intent Annotation

Both datasets enforce structured annotation:

  • Turn-level state tags: {Opening, Business_Analysis, Promotion_Introduction, UI_Guidance, Objection_Handling, Polite_Closing}
  • Intent representation in cloning pipelines: Discrete states (e.g., GREET, DISCOVER_NEEDS, PITCH_BENEFIT, HANDLE_OBJECTION, CLOSE_CALL) (Kaewtawee et al., 5 Sep 2025).

3. Applications and Benchmarks

TeleSalesCorpus forms the core for several advanced research applications:

  • Dialogue System Pretraining and Fine-Tuning: Used in supervised and RL-based frameworks (notably, Bayesian-supervised RL with GRPO), supporting strategy optimization under noisy and realistic conditions (Zhang et al., 15 Nov 2025).
  • Evaluation of Conversational Agents: Enables benchmarking with metrics such as turn-level F1 for intent classification, conversation-turn accuracy, summed composite quality scores (across six sales capabilities and seven metrics): Baseline (5.46), SFT-only (5.69), GRPO w/ SFT + DOGA (5.77), direct RL + DOGA (6.49).
  • ASR/NLP Pipeline Validation: Baseline ASR metrics: WER = 3.87%, accuracy ≈96.13% (Dao et al., 30 Jun 2025).
  • Automated Dialogue Playbook Design: Knowledge extracted from top-performing agents (via playbooks, knowledge manuals, persona definition) guides prompt engineering for voice AI (Kaewtawee et al., 5 Sep 2025).
  • Predictive Analytics and Summarization: Enables models for call success prediction, intent/slot detection, real-world noisy sentiment/emotion classification, and abstractive summarization.
  • NER Benchmarking: Redacted entity spans serve as ground truth for de-identification experiments.

4. Comparative Perspective and Data Uniqueness

Relative to prior open call center corpora:

  • Scale: CallCenterEN/TeleSalesCorpus is ≥90× larger than typical public releases (<1,000 dialogues), spanning nine sales and service domains (Dao et al., 30 Jun 2025).
  • Diversity: Includes Indian, Filipino, and American speakers; covers full call flows across inbound and outbound telephony.
  • Compliance and Metadata: Systematic PII redaction across >20 categories, with rich per-word metadata and CC BY-NC 4.0 licensing for non-commercial research.
  • Synthetic Benchmarks: LLM-based simulated conversations in TeleSalesCorpus uniquely enable ablation studies and model-centric evaluation frameworks, with quality controlled human curation and multi-metric performance tracking (Zhang et al., 15 Nov 2025).

A summary comparison table:

Characteristic TeleSalesCorpus/CallCenterEN Most Public Corpora
Dialogues 2,000–91,706 < 1,000
Domains 9 (sales/service) 1–3 (often generic)
Accent Diversity Indian, Filipino, American Limited
Struct. Annotation Detailed state+intent Often absent
PII Redaction 20+ categories, CCPA/DPDP comp. Sparse/incomplete
License CC BY-NC 4.0 Mixed/Restricted

5. Impact on TeleSales AI and Design Methodology

TeleSalesCorpus catalyzes research in several advanced directions:

  • Dialogue Model Development: Enables strategy-aware training and robust evaluation—AI-Salesman’s superior performance (as measured by LLM-as-a-Judge and human scoring) is directly tied to its training on TeleSalesCorpus (Zhang et al., 15 Nov 2025).
  • Real-Time Conversational AI: Serves as ground truth and playbook source for prompt engineering in low-latency voice agent systems, supporting dynamic steering and compliance checks (Kaewtawee et al., 5 Sep 2025).
  • Human-AI Benchmarking: Supports the design of human evaluation rubrics (22 criteria) for call center agents, identifying gaps (e.g., objection handling, sales drive) and refining agent personas and tactics iteratively (Kaewtawee et al., 5 Sep 2025).
  • NER and PII Research: The PII redacted corpus forms a basis for developing, benchmarking, and validating placeholder-based NER and de-identification systems (Dao et al., 30 Jun 2025).
  • Methodological Advances: Establishes a framework for integrating semi-synthetic, LLM-grounded corpora with seed data mining, scenario synthesis, and automated, expert-mediated filtering, enabling scalable and cost-effective data generation for low-resource domains.

6. Limitations and Future Developments

CallCenterEN omits audio data from the public release, limiting research into paralinguistic and speech synthesis components unless private collaboration is arranged. TeleSalesCorpus (LLM-synthesized) relies on scenario design and expert QA for realism, meaning rare or adversarial conversational paths may be underrepresented.

Future research directions proposed in (Kaewtawee et al., 5 Sep 2025) and (Zhang et al., 15 Nov 2025) include:

  • Large-scale simulation/self-play for dialogue diversity.
  • Retrieval-augmented generation for up-to-date, context-rich agent responses.
  • Emotion recognition for dynamic prosody and empathy.
  • Automated evaluation with LLM judges, calibrated with human rating baselines.
  • Expansion to adjacent domains such as healthcare and customer support using generalizable cloning pipelines.

The resource’s scale, annotation richness, and licensing position TeleSalesCorpus as a critical asset for advancing next-generation telephony-based conversational AI systems, particularly those focused on sales excellence and compliance in real-world, multilingual settings (Dao et al., 30 Jun 2025, Zhang et al., 15 Nov 2025, Kaewtawee et al., 5 Sep 2025).

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to TeleSalesCorpus.