AfriSpeech-MultiBench: ASR Benchmark
- The paper introduces a large-scale, multidomain benchmark built from 79.2 hours of harmonized data capturing over 100 African English accents.
- The paper evaluates 19 diverse ASR systems, revealing that fine-tuned compact models often outperform larger ones in accent robustness and domain-specific accuracy.
- The paper demonstrates significant WER variations across regions and application domains, emphasizing the need for localized adaptation and robust lexicon integration.
AfriSpeech-MultiBench is a verticalized, multidomain, multicountry evaluation suite for African-accented English Automatic Speech Recognition (ASR) systems, providing the first large-scale, domain-specific benchmarking resource capturing the continent's accent and application diversity. Designed for evaluating open, closed, unimodal, and multimodal speech recognition models, AfriSpeech-MultiBench offers a harmonized platform to assess ASR systems for over 100 distinct African English accents across 11 countries and seven application domains, facilitating rigorous model selection and performance auditing for speech technologies in African contexts (Ashungafac et al., 18 Nov 2025).
1. Dataset Construction and Composition
AfriSpeech-MultiBench is built from a harmonized aggregation of seven primary datasets comprising both spontaneous and scripted speech drawn from public and private sources, totaling approximately 79.2 hours over 20,093 audio clips. It features 108 African English accent varieties, 11 countries of origin, and 859 individual speakers. Data coverage includes:
- Spontaneous speech: ≈60% (Dialog, Parliament, Med-Conv, Call Center)
- Read/scripted speech: ≈40% (AfriSpeech-200, AfriNames)
Source datasets, their approximate sizes and domains, are summarized as follows:
| Corpus | Hours | Description/state |
|---|---|---|
| AfriSpeech-200 | 200 | Read English, 120 accents (clinical + general) |
| AfriSpeech-Dialog | 7 | Spontaneous medical/nonmedical conversation |
| Parliamentary | 35.9 | Noisy, multi-speaker debates |
| Med-Conv-Nig | 25 | Simulated doctor–patient, medical jargon |
| AfriNames | 8.9 | Read African names, numbers, commands |
| AfriSpeech-Countries | 67 | Read & conversational, mixed regions |
| Afro-Call-Centers | n/a | Private, real-world customer–agent dialogues |
This design ensures representative coverage for critical verticals and accent/country variation. Seven "vertical" domains are defined: Medical, General, Legal, Finance, Call Center, Named Entities, and Noise Robustness (Ashungafac et al., 18 Nov 2025).
2. Model Evaluation Suite
Nineteen ASR systems are benchmarked, spanning:
- Open-source conformers: Parakeet-tdt/rnnt (0.6–1.1B), Canary-1B-flash
- OpenAI Whisper family: large-v3 (1.54B), Distil-Whisper-v3.5 (756M), CrisperWhisper (1.54B)
- Speech-augmented LLMs: IBM Granite-3.3B, Mistral Voxtral-Mini-3B, NVIDIA Canary-Qwen-2.5B, MS Phi-4 MM-Instruct (5.6B)
- Proprietary/Cloud APIs: Intron-Sahara V1/V2, GPT-4o Transcribe, Google Gemini 2.0/2.5 Flash, Google Chirp-V3, AWS Transcribe, Azure Speech
Architectural diversity spans FastConformer (CTC/RNN-T), transformer encoder-decoder, and multimodal LLM-head systems. All models except Intron-Sahara V1/V2 are evaluated zero-shot; Intron-Sahara models are fine-tuned on a union of AfriSpeech-200, Dialog, Med-Conv, AfriNames, and related speech corpora for accent-specific adaptation (Ashungafac et al., 18 Nov 2025).
3. Evaluation Protocol and Metrics
Primary evaluation relies on Word Error Rate (WER), defined as:
where , , are substitution, deletion, and insertion errors; is the reference word count.
Additional diagnostics include:
- Named-entity error rate: WER on "Names" subsets (African names, number-rich utterances)
- Noise robustness: False-trigger rates (no-speech subset), WER on short/intervening silence segments
Data splits ensure no speaker overlap across train/test, with 41 accents held out in AfriSpeech-200 for zero-shot generalization assessment. Transcript normalization includes lowercasing, punctuation removal, filler-word removal, and number normalization. All systems use a standard 16 kHz mono pipeline with default decoding parameters (Ashungafac et al., 18 Nov 2025).
4. Benchmark Results
ASR performance on AfriSpeech-MultiBench reveals substantial domain and geographic variation:
- Global WER gap: Models achieve 1–3% WER on standard English corpora (LibriSpeech, TED-Lium, GigaSpeech), but 10–90% on AfriSpeech subsets.
- Regionally fine-tuned models: Intron-Sahara V2 achieves best-in-class accuracy (7.86% mean WER on robustness, 11.8% general, 15.3% medical, 7.9% noise, 4.3–12.4% on finance/names).
- Open models: Best Whisper/Conformer variants yield WERs of 10–25% on conversational data, but degrade to 35–50% on finance and names, and above 40% on structured prompts.
- Accent gradients: West African and North African accent subsets exhibit ≈30% mean WER, compared to ≈21–24% in East and Southern Africa across systems. GPT-4o and CrisperWhisper exceed 60% in challenging cases; Sahara V2 stays under 14% uniformly.
Results indicate that compact conformer models (0.6B) can rival or surpass much larger LLM-based architectures, undermining the notion that parameter count alone predicts accent-robustness. Fine-tuning on in-domain data substantially reduces error (Ashungafac et al., 18 Nov 2025).
5. Failure Modes and Hallucination Robustness
Frequent failure types include:
- Named-entity hallucinations: LLM-based systems often anglicize or hallucinate culturally mismatched named entities.
- Noise sensitivity: Widespread spurious insertions on silence—global false-trigger WER averages 58.9%, versus 0% for Sahara V2.
- Short-clip instability: Mean short-clip WER 45.6% globally; Sahara V2 achieves 16.0%.
- Partial utterance misalignment: Large LLMs hallucinate during intervening silence, open-source ASR sometimes inserts "unk" tokens.
The Sahara V2 model's lexicon grounding mitigates many hallucination modes, especially for entity-rich or silence-heavy segments (Ashungafac et al., 18 Nov 2025).
6. Practical Deployment Recommendations
Key recommendations based on empirical findings:
- For mission-critical verticals (medical, legal, finance), regionally tuned ASR (e.g., Intron-Sahara V2) is required to achieve ≤15% WER and low real-time factor (RTF≈0.3).
- Budget-conscious scenarios: Lightweight conformers, such as Parakeet-tdt-0.6B or Distil-Whisper-v3.5, offer reasonable zero-shot accuracy (20–30% WER).
- Entity-sensitive use cases (name recording, financial commands): Integrate localized lexicons or fine-tune with relevant entity prompts, as most models still exceed 40% WER without in-domain adaptation.
- Cloud deployment: Substantial WER variation depending on accent mix; verify empirically and consider cost/latency trade-offs.
- Voice-activated systems: Check for false-trigger robustness; only Sahara V2 achieved zero false-positives in silence tests (Ashungafac et al., 18 Nov 2025).
7. Significance and Implications
AfriSpeech-MultiBench exposes a 5×–10× WER gap between global English and African-accented speech, providing a critical platform for benchmark-driven development and adoption of inclusive ASR in Africa. The absence of consistent correlation between model size and accented-speech performance (with compact, regionally tuned models outperforming larger ones in many domains) suggests that localized linguistic adaptation is crucial. Persistent entity and silence hallucination failures highlight the need for robust lexicon integration and domain adaptation strategies.
By establishing the first end-to-end, multidomain ASR benchmark reflecting the challenges of African-accented English, AfriSpeech-MultiBench enables principled model selection, diagnostic error analysis, and targeted ASR enhancement for voice-driven applications in linguistically underserved regions (Ashungafac et al., 18 Nov 2025).