CAFE Benchmark Dataset
- CAFE Benchmark Dataset is a multilingual corpus featuring 37 hours of conversational speech with spontaneous code-switching in Algerian Arabic, French, and English.
- It comprises both manually annotated (CAFE-small) and pseudo-labeled (CAFE-large) segments, enabling robust benchmarking and detailed error analysis.
- The dataset employs sophisticated annotation techniques, marking code-switching points, overlapping speech, and non-lexical events for low-resource dialect research.
The CAFE (Code-switching Algerian-French-English) Benchmark Dataset is the first publicly released speech resource capturing spontaneous code-switching phenomena in Algerian dialect, French, and English. Sourced from in-vivo human-human conversations in real-world acoustic conditions, primarily via YouTube podcast episodes such as the Gusra Podcast, CAFE addresses distinct linguistic challenges associated with North African Arabic varieties and their sociolinguistic intricacies. The dataset comprises approximately 37 hours of conversational speech from around 100 speakers of varying regional dialects and sociolinguistic backgrounds, with approximately 2 hours 36 minutes manually annotated and released as CAFE-small, and the remaining 34.58 hours provided as pseudo-labeled transcriptions (CAFE-large). This release also includes gold-standard annotations for code-switching, overlapping speech, various non-lexical events, and explicit code-switching point marking. The corpus enables benchmarking with state-of-the-art Automatic Speech Recognition (ASR) systems, providing a standardized platform for research in low-resourced dialectal and multilingual ASR (Lachemat et al., 2024).
1. Corpus Architecture and Source Characteristics
CAFE contains ≈37 hours of spontaneous speech segmented into two principal sets. CAFE-small consists of 2h36m of manually segmented and transcribed audio (170 segments, 35 speakers), while CAFE-large consists of ≈34.58 hours of pseudo-labeled content derived from Whisper Large-v3 transcriptions with Preprocessing-V2 (as described in Section IV.3 of the source). Speakers represent diverse Algerian regional and socioeconomic backgrounds, with dialect levels (L0–L4) classified per the ZAEBUC scale. The dataset captures high code-mixing index (CMI average 0.254), manifesting as frequent intra- and intersentential code-switching among Algerian Arabic (“Darja”), French, and English.
Recordings originate from real-world environments and therefore capture noise, music, overlapping speech, laughter, coughs, and a variety of non-lexical vocalizations, enhancing ecological validity and modeling difficulty. High prevalence of regional phonological and lexical features, alongside significant French/English code-mixing and loanwords, distinguishes CAFE from existing code-switching corpora.
2. Annotation Regime and Event Taxonomy
CAFE-small leverages two complementary manual annotation modes:
- Raw transcription: Unconstrained, faithful word-for-word capture using Arabic script for dialect, Latin script for French/English, and numeric digits. No punctuation normalization is enforced.
- ZAEBUC-Spoken guidelines: Standardized markup, including explicit event labels—noise, {laugh}, {cough}, {cry}, and sequence delimiters (%, + prefix for morphological code switches, script tags for language shifts).
Code-switching points are marked inline through script changes and prefix conventions. Overlapping speech segments (multi-speaker overlap) are annotated both in the main corpus and isolated in a dedicated 28-minute analytic subset. Non-speech and acoustic events employ precise bracketed or braced labels.
Speech segmentation utilized pydub's split_on_silence module: minimum silence threshold 1000 ms at –45 dB. Chunking targeted durations of 25–60 s (first pass) and 15–120 s (second pass), with up to 10 chunks per file. This methodology prioritizes contextual integrity to support LLM-style ASR decoders.
3. Dataset Splits and Speaker Distribution
| Subset | Duration | Segments/Chunks | Speakers (F/M) | Annotation Mode | Transcription Quality |
|---|---|---|---|---|---|
| CAFE-small | 2h 36m (857MB) | 170 | 4 / 31 | Manual (dual-guideline) | Gold-standard |
| CAFE-large | ≈34.58h (~12GB) | ~2,000+* | 18 / 82 | Pseudo-label (Whisper) | Auto; manual review ongoing |
*Approximate segment count; varies by chunking strategy.
CAFE-small strictly conforms to both manual annotation regimes, with dialect balance (L3 and higher: 136 chunks). CAFE-large’s pseudo-labels are generated by PromptingWhisper Large-v3 with Preprocessing-V2; ongoing human corrections are planned for future updates.
4. ASR Benchmarking Methodology
CAFE benchmarking employs several advanced ASR pipelines:
- PromptingWhisper: Open-source variant of Whisper with prompt engineering. Large-v2 models use bilingual prompts, while large-v3 employs bi/trilingual prompting modes.
- WhisperOriginal: Official OpenAI Whisper large-v3 implementation with standard greedy decoding (T=0) and temperature-controlled strategies (T ∈ {0.0, 0.2, …, 1.0}), as well as fallback mechanisms (dynamic adjustment of temperature, beam size, and patience).
- Preprocessing Pipelines:
- Preprocessing-V1: pyannote speaker-diarization plus overlapped-speech removal and non-speech excision >1.5 s
- Preprocessing-V2: pyannote speaker-diarization only, removes non-speech >0.4 s but retains overlaps
Optimized results require temperature (T) tuning; best CAFE-small outcomes are achieved at T=0.2. WhisperOriginal uses a 30 s context window per chunk. No neural fine-tuning is performed beyond prompt engineering and decoding strategies.
5. Evaluation Metrics, Results, and Error Analysis
CAFE assessment employs three principal metrics:
- Word Error Rate (WER):
- Character Error Rate (CER):
- Mixed Error Rate (MER): — where “tokens” generalize over word/subword units, covering dialectal morphologies.
Primary results (CAFE-small, T=0.2; best configs):
| Model & Pipeline | WER | MER | CER |
|---|---|---|---|
| PromptingWhisper large-v2 | — | 0.735 | 0.643 |
| PromptingWhisper large-v3 | — | ~0.665 | ~0.660 |
| WhisperOriginal (no preprocess) | 0.526 | 0.333 | 0.345 |
| WhisperOriginal + Preprocessing-V2 | 0.538 | 0.310 | 0.329 |
Key error patterns include missing initial tokens in the presence of leading non-speech events (music, overlaps, laughter), mitigated by manually removing the first 3 s of affected segments (MER reduced from 0.97 to 0.35). Overlapping speech, when retained, occasionally yields better ASR versus naive filtering. The Whisper fallback mechanism enhances robustness but is susceptible to hallucination in long transcriptions (>20 tokens). Elevated code-switching rates (CESAR up to >0.4) correlate with higher optimal decoding temperatures.
6. Recommendations and Prospective Applications
Optimal ASR on CAFE necessitates sophisticated preprocessing—specifically Preprocessing-V2 style chunk initialization to remove leading non-speech while retaining speaker overlap. Chunk-level or CESAR-conditioned temperature tuning around T≈0.2 is recommended to balance determinism and diversity. Context window extension and chunk segmentation refinement can further reduce truncation-induced errors. Fine-tuning Whisper models on CAFE-small is suggested to specifically address code-switching phenomena. Development of a front-end dialect/LID module is proposed to guide multilingual transcription at the utterance level.
Prospective use-cases for the CAFE corpus include benchmarking multilingual and code-switching ASR systems, advancing dialect identification methodologies, training/fine-tuning speech recognition models on low-resource Algerian and North African dialects, and facilitating analysis of spontaneous conversational speech phenomena in noisy, naturally overlapping environments.
7. Data Accessibility, Licensing, and Citation
The CAFE-small (manually annotated) and CAFE-large (pseudo-labeled) datasets are publicly available for personal and classroom use without fee, with citation required for non-commercial distribution. Commercial or server-based redistribution requires ACM permission ([email protected]). A download link will be supplied on the project website following paper acceptance. Associated Python scripts and ASR-pipeline code are accessible via GitHub. For formal citation in academic work:
Lachemat H. E-O., Akli A., Oukas N., El Kheir Y., Haboussi S., and Chowdhury S. A. 2024. CAFE: A Novel Code-switching Dataset for Algerian Dialect, French and English. Proc. ACM Meas. Anal. Comput. Syst. 37, 4 (Aug. 2024), 24 pp. (Lachemat et al., 2024)
This resource aims to foster research progress in code-switching ASR and low-resource, multilingual speech processing.