MENASpeechBank: MENA Speech Corpus
- MENASpeechBank is a comprehensive MENA speech corpus integrating real reference voice data with synthetic, persona-conditioned multi-turn dialogues.
- It offers precise control over speaker identity, persona attributes, and dialect variations, ensuring rigorous quality through low transcription error rates and detailed metadata.
- The resource supports AudioLLM research by providing rich annotations, benchmark scenarios, and robust evaluation methods for speech personalization and conversational grounding.
MENASpeechBank is a reference speech corpus and scalable synthetic data pipeline designed to address data scarcity for Audio LLMs (AudioLLMs) in the MENA (Middle East and North Africa) context, with explicit focus on multi-speaker, persona-grounded, and dialect-diverse speech-text data. It integrates a real reference voice bank of 124 speakers from 18 MENA countries and generates approximately 417,000 persona-conditioned, multi-turn dialog role-plays with precise control over speaker identity, persona attributes, and conversational scenario coverage (Ali et al., 3 Feb 2026).
1. Reference Voice Bank Composition
MENASpeechBank's foundation is an 18,000-utterance repository, comprising 124 unique speakers balanced for gender (56% male) and age strata (ranging from under 20 to 60+). The total speech duration is ∼26.4 hours, distributed as follows:
| Language/Variety | Speakers | Utterances | Avg Duration (s/utt) |
|---|---|---|---|
| Modern Standard Arabic | 101 | 10,777 | 6 |
| English | 29 | 5,548 | 6 |
| Gulf Arabic | 15 | 287 | 5 |
| Egyptian Arabic | 12 | 515 | 5 |
| North African Arabic | 10 | 168 | 4 |
| Levantine Arabic | 10 | 346 | 5 |
Data originate from in-house recordings (aligned to script), QASR (MSA), and ADI-17 (dialectal). All utterances meet stringent post-filtering of automatic transcription word error rate (WER ≤ 0.05) using Fanar Transcription for Arabic and Whisper-Small for English. Each speaker's metadata, including country, mother tongue, age, education, profession, and city, is stored in structured JSON format for downstream dynamic persona synthesis.
2. Persona Profile Construction
Persona enrichment leverages both real participant data and statistically controlled synthetic augmentation. The final persona pool consists of 469 entries (124 real, 345 synthetic), generated via the following stochastic schema:
- Age:
- Gender: equiprobable categorical draw
- Country-consistent personal attributes (name, city, profession), uniform sampling within age buckets
- Marital status and household type: fixed uniform categories
World Values Survey (WVS) Wave 7 aggregates, mapped according to speaker country and age without explicit numeric score exposure, yield values-aligned persona fields such as trust in institutions, religiosity, and family orientation. Digital and AI usage attributes (device, connectivity, AI competence, use cases) are sampled categorically.
Big Five OCEAN personality vectors are produced via small Gaussian perturbation of fixed basepoints (). Deduplication is enforced by discarding persona summaries with cosine similarity ≥ 0.80 via all-MiniLM-L6-v2 embeddings.
Persona Quality Index (PQI) evaluates both static schema fidelity (, 5 checks) and narrative richness (, 7 checks), producing . 406 of 469 personas attain perfect PQI; nearly all (468/469, 99.8%) are PQI ≥ 10 (mean 11.85).
3. Conversational Scenario Taxonomy and Generation
A hierarchical scenario taxonomy is established, partitioned into:
- Task/Service domains: e.g., Finance → Banking, Health → Telemedicine, Public Services, etc.
- Knowledge/Topic domains: e.g., Culture & Society, Food & Drink, Business, Religion, Sports, Nature, Events.
For each taxonomy leaf node, GPT-4.1 generates topics; each topic then spawns concrete user-agent dialog scenarios, filtered for near-duplicates (embedding similarity ≥ 0.85). This pipeline results in 4,521 unique scenarios, e.g., "A young professional wants to learn about local volunteer groups and how to join community events."
Persona-to-scenario alignment computes embedding-based cosine similarity and Jaccard token similarity, yielding a hybrid score:
where is the dot-product cosine of all-MiniLM-L6-v2 persona and scenario embeddings, and is the Jaccard index of summary token sets. Only pairs with are retained, resulting in approximately 4,690 persona–scenario assignments for dialog generation.
4. Synthetic Data Pipeline
Dialogues are instantiated using GPT-4.1 in a system/user JSON-only prompt framework that includes detailed persona fields, scenario definition, and specified number of message turns (4–8 per dialog). In each conversation, the user role is explicitly cast in persona voice, with the assistant behaving as a cooperative agent.
The user side of each generated dialogue is synthesized using XTTS-v2 (Coqui), a reference-based multi-speaker TTS/voice-cloning system. The pipeline consists of a speaker encoder (to derive fixed-length speaker embedding ), a text-to-mel spectrogram decoder (conditioning on and the utterance text), and a neural vocoder.
Training optimizes a loss comprising mel-spectrogram L1 error () and speaker similarity penalty (), with balancing coefficient .
The dataset ultimately comprises 416,497 multi-turn conversations (∼2.1M messages), and 417,000 synthesized user-turn utterances. Quality assurance samples (100 synthesized utterances per reference speaker) yield the following aggregate metrics:
| Metric | Value |
|---|---|
| WER | 10% |
| NISQA | 3.60/5 |
| SpkCos | 0.49 |
5. Evaluation Methodology and Results
For controlled evaluation, 100 Modern Standard Arabic (MSA) conversations from the test split are re-recorded by 10 NDA-contracted annotators (human reference WER: 12%).
Model coverage includes audio-native (Gemini-2.5 Pro, GPT-audio), ASR-cascade (GPT-audio → GPT-5.2), and fully fine-tuned (Qwen2.5-Omni-7B pretrained on 10K hr ASR data, LoRA-tuned on 330K conversations) agents. Evaluations cover both assistant- and user-initiated chat paradigms.
Scoring uses an eight-dimensional LLM-as-Judge rubric (relevance, completeness, specificity, coherence, context tracking, calibration, tone match, safety). Main metrics:
- Average Rubric Score (ARS): mean per-turn pass rate (across all checks)
- Average Pass Rate (APR): fraction of turns jointly passing all required rubric checks
Key results:
| Model | APR Synthetic | APR Human | ARS Synthetic | ARS Human |
|---|---|---|---|---|
| Gemini-2.5 Pro | 0.980 | 0.970 | 0.994 | 0.996 |
| GPT-audio | 0.970 | 0.940 | 0.986 | 0.985 |
| GPT-audio + GPT-5.2 | 0.960 | 0.940 | 0.986 | 0.986 |
| Qwen2.5-Omni-7B FT | 0.806 | 0.809 | 0.944 | 0.944 |
| Arabic cascade (Fanar+Allam) | ≈0.50–0.64 | — | ≈0.75–0.85 | — |
Findings: The decrement in APR between synthetic and human-speech test sets is modest (1–3 points); ARS remains stable. Audio-native architectures demonstrate superior overall performance relative to ASR → LLM cascades. Fine-tuning Qwen2.5-Omni-7B yields an APR gain of over 50 points compared to its base open model (Ali et al., 3 Feb 2026).
6. Impact, Applications, and Limitations
MENASpeechBank constitutes the first publicly released MENA-centric multi-speaker reference voice bank, with robust coverage across English, Modern Standard Arabic, and major regional dialects. It provides not only a large, balanced corpus for pre-training and evaluation but also rich annotations (persona profile, WVS dimensions, OCEAN vectors, AI usage profiles) for in-depth studies of speaker personalization, context-grounded interaction, and dialectal resilience.
Key application domains include:
- Turn-level and multi-turn dialog benchmark tasks for AudioLLMs
- Evaluation of long-context spoken dialogue memory (profile and context preservation)
- Speech-conditioned personalization: style, register, fine-grained persona adherence
- Robustness studies targeting accent and dialect variability
- Preference-based alignment and reranking by coherence/persona fidelity
However, limitations persist: dialectal balance is skewed toward MSA and English; synthetic speech may lack the naturalistic prosodic variation and disfluency patterns of spontaneous human speech; the persona schema enforces a binary gender and bounded age; and conversational contexts are bounded by predefined taxonomy and prompt design.
Future improvements are anticipated to expand dialectal coverage, refine alignment to WVS, further human-audit synthetic speech outputs for fidelity, and extend evaluation to code-switching, paralinguistic factors, far-field recording, and multi-party dialog phenomena.
7. Significance for AudioLLM Research
MENASpeechBank facilitates progress in conversational grounding, personalization, and accent/dialectal robustness for AudioLLMs through its unique integration of real voice data, principled persona construction, large-scale scenario taxonomy, and scalable synthesis pipeline. The resource directly addresses the acute lack of publicly available, richly annotated, persona-consistent, multi-dialectal spoken data for the MENA region. The open release of both the reference bank and the synthetic conversations is positioned to serve as a benchmark and development bed for future AudioLLM research and application in diverse, multilingual, and value-conditioned settings (Ali et al., 3 Feb 2026).