Papers
Topics
Authors
Recent
2000 character limit reached

CallCenterEN: English Call Center Transcripts

Updated 24 November 2025
  • CallCenterEN is a large-scale dataset of over 91,000 call center dialogues, with high transcription quality and an average WER of 3.87% from a 0.1% manually reviewed subset.
  • It employs a robust, two-stage privacy-preserving redaction process that combines automated NER with manual QA to anonymize over 40 types of PII across diverse support domains.
  • The rich metadata and per-word temporal annotations enable rigorous experimental design for research in ASR model training, dialogue system development, and intent detection.

CallCenterEN refers to a large-scale, open-source dataset of real-world English-language call center transcripts, constructed to support methodologically rigorous research and development in the domain of automated customer support, sales AI, and associated natural language processing systems. Characterized by industry-scale coverage, high transcription quality, privacy-centric PII redaction, and rich metadata, CallCenterEN is at present the most comprehensive public English call center transcript repository, substantially influencing model training, benchmarking, and domain adaptation for both speech and text-based customer service research (Dao et al., 30 Jun 2025).

1. Dataset Composition and Provenance

CallCenterEN comprises 91,706 agent–customer conversations, totaling 10,448 hours of raw audio source (not publicly released). Dialogues are derived from authentic inbound (91.3%) and outbound (8.7%) call flows in business-process-outsourcing (BPO) operations serving primarily U.S.-based customers. Agent speech includes major English dialects from India, the Philippines, and the United States.

The dataset is stratified across multiple support and sales subdomains, such as healthcare (Medicare), insurance, automotive, and home services. The predominance of Medicare inbound calls (67.1%) influences vocabulary and topical distributions, requiring consideration in downstream domain generalization.

2. Transcription and Data Quality

Transcripts in CallCenterEN are generated using a premium commercial ASR pipeline provided by AssemblyAI, following standardized transcription guidelines. Each utterance is accompanied by word-level timestamps and per-word confidence scores. The system-level ASR confidence rates per conversation lie in the [86%, 98%] range. To provide a ground-truth quality anchor, 0.1% of conversations were randomly selected for manual review, yielding an average Word Error Rate (WER) of 3.87%. The WER metric is reported as

WER=I+D+SN×100%\mathrm{WER} = \frac{I+D+S}{N}\times 100\%

where II (insertions), DD (deletions), and SS (substitutions) are edit operations relative to a human reference transcription over NN reference words.

The mean certified accuracy is 96.13%. Transcripts are organized as JSON objects including dialogue turn lists, per-word segmentations (word, start_time, end_time, confidence), overall_conversation_confidence (float), and audio_duration_seconds (integer), which supports both fine-grained linguistic analyses and alignment for downstream acoustic studies.

3. PII Redaction and Privacy Compliance

Given the intrinsic sensitivity of call center data, PII protection in CallCenterEN is enforced through a two-stage process:

  1. Automated detection: A combination of rule-based and model-based NER components identifies over 40 PII types (e.g., names, phone/email, account and credit card numbers, SSN, dates, locations, and medical conditions).
  2. Manual quality assurance: A random 0.1% subset undergoes human review to capture any residual PII missed by automation.

All redacted content is replaced with standardized placeholder tokens. This design ensures compliance with CCPA, India’s DPDP 2023, and other major privacy regimes. Although a formal “redaction recall” is not published, the methodology is tailored to drive residual risk to near-zero.

4. Metadata Schema and Licensing

Each record in CallCenterEN includes:

  • call_id: Unique global identifier
  • call_type: Categorical (inbound/outbound)
  • domain_tag: Subdomain descriptor (e.g., “Medicare_inbound”)
  • agent_accent: Categorical (where available)
  • customer_region: (where available)

This metadata enables controlled stratification and rigorous experimental design (e.g., for accent- or domain-specific ASR benchmarking, intent classification by channel). CallCenterEN is released under a CC BY-NC 4.0 license: attribution is required, all use must be non-commercial, and users are barred from resale, redistribution, and reidentification attempts (Dao et al., 30 Jun 2025).

5. Research Applications and Benchmarks

CallCenterEN’s scale, production diversity, and high transcript fidelity enable multiple research lines:

  • ASR Model Training and Evaluation: Researchers can refine noisy-channel, diarization-robust ASR models on real call dialogue. The released human-reviewed 0.1% split functions as a fixed test set for WER and confidence benchmarking.
  • Dialogue System Development: Dialogue modeling, turn-taking prediction, and agent-assist response generation can be advanced using a high-coverage corpus with clean speaker turns and real-world business context.
  • Intent and Slot Detection: Rich multi-domain coverage supports intent detection and slot-filling, especially for specialized support verticals. Per-turn and per-call metadata enables stratified performance evaluation.
  • PII Detection and De-Identification: Paired access to redacted and original text (where permitted under confidentiality) establishes a direct, reproducible benchmark for NER, privacy-preserving LLMs, and automated PII pipelines.
  • Call-Outcome Prediction: Human-annotated outcomes (resolved/unresolved) or last-turn-based heuristics facilitate downstream predictive analytics and QA model training.

Key baseline metrics: WER on the test set is 3.87%, and ASR confidence per conversation falls in the 86–98% interval.

6. Limitations and Future Extensions

Principal limitations reported by the dataset curators include:

  • Human QA covers only 0.1% of the corpus, which is below classical statistical significance targets for large corpora. Numerous transcription or redaction errors may persist in the remaining 99.9%.
  • Absence of released audio for public research restricts acoustic and multimodal studies to transcript-only pipelining; special arrangements are required for audio access.
  • The Medicare bias (67.1% of calls) may skew LLMs or fine-tuned NER/intent classifiers toward health- and insurance-specific phraseology.
  • Background noise, cross-talk, and other real-world audio artifacts are reflected in the transcripts, but only partially corrected during human review.

Future dataset extensions proposed include larger-scale human QA, selective audio release under privacy controls, expanded coverage to European/multilingual call centers, and integration of fine-grained intent and sentiment annotations to support richer supervised tasks (Dao et al., 30 Jun 2025).

7. Significance in the Research Landscape

CallCenterEN represents a critical corpus filling the public data gap for English-language call center analytics. Its comprehensive coverage of real, privacy-compliant transcripts, robust per-word temporal metadata, and rich domain diversity directly address historic limitations in dialogue system training and ASR domain adaptation. The combination of scale, thoroughly engineered redaction, and rigorous metadata enables reproducible benchmarking and supports state-of-the-art research in intent detection, quality control, and privacy technology for conversational AI in customer support contexts (Dao et al., 30 Jun 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to CallCenterEN.