Papers
Topics
Authors
Recent
2000 character limit reached

CaFFe Benchmark: Multilingual ASR Dataset

Updated 19 December 2025
  • The CaFFe benchmark dataset is a publicly available multilingual code-switching speech corpus capturing spontaneous conversations in Algerian Arabic, French, and English.
  • It comprises 37 hours of data split into manually annotated and pseudo-labeled subsets, with detailed markings for segmentation, overlapping speech, and sociolinguistic events.
  • ASR evaluation using Whisper pipelines demonstrates optimal performance with advanced preprocessing strategies, addressing challenges in code-switching and dialectal variation.

The CAFE benchmark dataset is a publicly released speech corpus designed for the evaluation of automatic speech recognition (ASR) systems in multilingual and code-switching contexts, specifically involving Algerian dialectal Arabic (Darja), French, and English. Unique among available resources, CAFE captures spontaneous in vivo human–human conversational speech, exhibiting natural phenomena such as code-switching, overlapping utterances, interruptions, background noise, and sociolinguistic variation. Developed by Lachemat et al., it addresses the distinct linguistic challenges posed by North African Arabic dialects, notably in scenarios where lexical and morphological code-switching co-occur with substantial phonological variation, informal register, and non-lexical vocalizations (Lachemat et al., 20 Nov 2024).

1. Dataset Composition and Demographics

The CAFE dataset comprises approximately 37 hours of speech data, divided into two major subsets:

  • CAFE-small (2 hours 36 minutes): Fully manually annotated, including segmentation, transcription, explicit labeling of code-switch points, overlapping speech, environmental events (noise, laughter), and speaker/dialect metadata. This subset features 35 speakers (4 female, 31 male) and 170 annotated chunks stratified by dialectness: L0 (10), L1 (11), L2 (14), L3 (79), L4 (56).
  • CAFE-large (34 hours 35 minutes): Contains pseudo-labeled transcriptions derived via advanced ASR pipelines, with around 100 speakers sourced from public YouTube podcasts. Its demographic breakdown includes 18 female speakers and a majority male representation.

Regional coverage encompasses northern, central, and southern Algeria, including both urban (Algiers) and rural variants. This geographical breadth ensures documentation of major phonological variants (such as “q”→[k] and “q”→glottal stop) and heavy incorporation of French and English loanwords.

2. Recording, Sociolinguistic Contexts, and Annotation

All recordings are sourced from public YouTube podcasts (exemplified by Gusra Podcast), using a consistent technical standard: WAV format, 48 kHz sampling rate, 16-bit mono. The material consists exclusively of spontaneous conversational interactions, characterized by code-switching within and between utterances, extensive speaker overlap, back-channeling, interruptions, laughter, coughing, fillers, and fluctuating noise/music backgrounds.

Annotation employs a multi-stage protocol:

  • Segmentation: Automated silence-based chunking (pydub.split_on_silence; min silence 1000 ms, threshold –45 dB; default chunk length 25–60 s, fallback 15–120 s).
  • Transcription: Two paradigm variants for CAFE-small—raw manual annotation and systematic application of ZAEBUC-Spoken rules [Hamed et al. 2024]. Standardized tag sets are used for marking noise (“[noise]”), laughter (“{laugh}”), repetitions, incomplete words, and interruptions. Arabic script annotates Darja/MSA; Latin script marks French and English.
  • Explicit code-switch tagging, time-aligned overlapping speech labeling, and event tagging are integral to both evaluation and analysis.

CAFE-small is currently positioned as the benchmark evaluation set, while CAFE-large is intended for semi-supervised training and future large-scale ASR adaptation.

3. Benchmarking Pipelines and ASR Evaluation

CAFE features a rigorous set of ASR benchmarking experiments using state-of-the-art models:

  • PromptingWhisper (large-v2)
  • Whisper large-v3 (PromptingWhisper prompting)
  • WhisperOriginal (OpenAI implementation)
  • WhisperOriginal with Preprocessing-V1 and V2

Data processing involves a multi-step workflow:

  1. YouTube crawl via youtube-dl; conversion to WAV.
  2. Acoustic chunking on silence.
  3. Preprocessing-V1: pyannote diarization + overlapped-speech detection, removal of segments with overlapped speech or >1.5 s non-speech.
  4. Preprocessing-V2: pyannote diarization, removal of non-speech >0.4 s only (overlaps retained).
  5. Whisper log-Mel feature extraction.
  6. Pseudo-label generation: CAFE-small via Whisper large-v3 default decoding, CAFE-large via diarization-based speaker segmentation plus Whisper large-v3.

Decoding strategies incorporate language prompting (bilingual <|ar|><|fr|> and <|ar|><|en|>, multilingual <|ar|><|en|><|fr|> tokens), temperature-controlled greedy search (temperatures between 0.0 and 1.0), and fallback mechanisms adjusting temperature, beam size, and “best of” settings. Optional heuristics include beam search, length normalization, and LLM fusion.

4. Evaluation Metrics and Experimental Results

Primary metrics reported include:

  • Word Error Rate (WER): WER=S+D+INrefWER = \frac{S + D + I}{N_{\mathrm{ref}}}
  • Character Error Rate (CER): CER=S+D+ICrefCER = \frac{S + D + I}{C_{\mathrm{ref}}}
  • Mixed Error Rate (MER): MER=S+D+INref+NhypMER = \frac{S + D + I}{N_{\mathrm{ref}} + N_{\mathrm{hyp}}}

Where SS = number of substitutions, DD = deletions, II = insertions, NrefN_{\mathrm{ref}} = number of reference tokens (words), CrefC_{\mathrm{ref}} = number of reference characters, NhypN_{\mathrm{hyp}} = number of hypothesis tokens.

MER employs hybrid word–subword tokenization, enabling robust error quantification across code-switch boundaries and dialectal morphologies.

Performance Table on CAFE-small (MER, CER):

Pipeline Temp WER MER CER
PromptingWhisper (large-v2) 0.0 0.735 0.643
PromptingWhisper (large-v3, bilingual+) 0.0 0.665 0.660
WhisperOriginal 0.0 0.526 0.333 0.345
WhisperOriginal 0.2 0.529 0.335 0.352
Preprocessing-V1 + WhisperOriginal 0.0 0.538* 0.355 0.365
Preprocessing-V2 + WhisperOriginal 0.0 0.531 0.316 0.339
Preprocessing-V2 + WhisperOriginal 0.2 0.538 0.310 0.329

*WER for Preprocessing-V1 is interpolated.

Notably, PromptingWhisper models demonstrate high MER values (0.66–0.74), suggesting that prompting strategies have minimal impact on complex code-switching contexts. WhisperOriginal pipelines, especially with Preprocessing-V2 and temperature = 0.2, achieve optimal performance (MER = 0.310, CER = 0.329, WER = 0.538).

Common error modes include mis-segmentation of non-speech events, language fallback to dominant segmental language, code-switch boundary misclassification, and loss of linguistic content through over-removal of overlapping speech (Preprocessing-V1).

5. Linguistic, Technical, and Methodological Challenges

CAFE exposes several key obstacles for ASR systems:

  • Highly phonetic nature of Darja, frequent vowel/consonant elision, rapid prosody.
  • Complex morphological code-switching, mixing French/English stems with Arabic affixes.
  • Prevalence of overlapping speech, non-lexical vocalizations, informal registers.
  • Low-resource dialect, with scarce public lexicons or pronunciation resources.

Handling these requires careful balancing of preprocessing, annotation precision, and model adaptation. A plausible implication is that future ASR architectures for Algerian dialect or similar environments must accommodate hybrid tokenization schemes and explicit event annotation protocols.

6. Best Practices and Recommendations for ASR on CAFE-like Datasets

Optimal ASR performance on CAFE and analogous corpora is achieved through:

  • Preserving overlapping speech segments, while selectively removing early non-speech to retain context.
  • Utilizing preprocessing pipelines integrating diarization and minimal non-speech removal (Preprocessing-V2).
  • Hyperparameter tuning of decoding temperature (optimal at ≈ 0.2), leveraging WhisperOriginal’s fallback strategies.
  • Adopting MER as the principal evaluation metric, given its code-switch robustness.
  • Ensuring explicit annotation of events (noise, laughter, switch points, overlap).
  • Structuring releases with small manually annotated and large pseudo-labeled subsets for semi-supervised learning regimes.

These practices reflect empirical findings from Lachemat et al., and offer a generalizable methodology for future code-switching ASR research in low-resource, sociolinguistically complex domains (Lachemat et al., 20 Nov 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to CaFFe Benchmark Dataset.