QuranMB.v1: Benchmark for Arabic Recitation
- QuranMB.v1 is a benchmark that standardizes mispronunciation detection in vowelized MSA with a focus on Qur’anic recitation by combining real and synthetic audio data.
- It employs a unified pipeline for data curation, phoneme transcription, controlled error simulation using phoneme confusion matrices, and hierarchical evaluation using SSL models and LSTM networks.
- Baseline evaluations reveal challenges in the task with the best model achieving an F1-score of 29.88% and a phoneme error rate of 16.42%, highlighting the need for advanced methodologies.
QuranMB.v1 is a publicly available benchmark designed for the assessment of mispronunciation detection in Modern Standard Arabic (MSA) with a focus on Qur’anic recitation. It establishes a unified pipeline for data curation, phoneme transcription, error simulation, model training, and hierarchical evaluation. This benchmark enables standardized research and development in Arabic pronunciation assessment technologies, particularly targeting the rigorous requirements of Qur’anic language and recitation accuracy.
1. Curation and Structure of the Benchmark Dataset
QuranMB.v1 is constructed through an integration of real and synthetic Arabic speech corpora. The training data primarily derives from the CMV-Ar dataset, a subset (82.37 hours) of the Common Voice corpus, which is automatically vowelized and enhanced with Qur’anic recitation excerpts to guarantee vowelization accuracy. For augmentation, two controlled speech synthesis datasets are generated—one with canonical (error-free) pronunciations and another with injected mispronunciation patterns—using five male and two female TTS voices. This strategy yields both clean and error-annotated training data.
The test set comprises about 2.2 hours, sampled from 98 selected Qur’anic verses, each recited by 18 native Arabic speakers. Each speaker was guided by a custom tool designed to instruct the placement of intentional mispronunciation patterns. The text interface highlights target regions for modification, ensuring consistency with annotated expected errors.
Error simulation is systematically driven by confusion matrices of phoneme similarity, such as substituting /s/ for /S/ or /t/ for /T/. For every transcript, four characters or diacritics are randomly replaced based on these matrices. The procedure guarantees complete and known error annotation, foundational for both model supervision and evaluation.
2. Phoneme Inventory and Transcription Protocol
The phoneme set for QuranMB.v1 is specifically tailored for vowelized MSA, integrating gemination and vowelization distinctions but purposely omitting contextual emphatic distinctions (e.g., ignoring the split between vowels following emphatic and non-emphatic consonants). The final inventory encodes 68 phonemes, with explicit handling of geminated forms (such as /b/ and /bb/).
Phoneme transcription is performed via Nawar Halabi’s phonetizer, which converts fully vowelized Arabic text into phoneme sequences at both training and inference time. This protocol is essential for aligning spoken audio with canonical transcriptions and accurately detecting mispronunciation events.
3. Model Architecture and Baseline Systems
The benchmark utilizes self-supervised learning (SSL) acoustic models for feature extraction, with four major variants investigated: monolingual English-pretrained wav2vec2, HuBERT, WavLM, and the multilingual mHuBERT pretrained across 147 languages. During training, SSL weights are frozen and features from transformer layers are aggregated (weighted sum) and then fed into a bidirectional LSTM network with two layers and 1024 units per layer.
Transcription decoding is performed using the Connectionist Temporal Classification (CTC) loss: the LSTM-network predicts phoneme sequences directly from SSL features, and a greedy CTC decoder reconstructs the final phoneme output sequence for assessment.
Model Evaluation Taxonomy
A hierarchical taxonomy is adopted for detection results:
- True Acceptance (TA): Correctly identified canonical pronunciations.
- False Acceptance (FA): Incorrectly accepted mispronounced sounds.
- False Rejection (FR): Incorrectly rejected canonical pronunciations.
- Correct Diagnosis (CD) / Error Diagnosis (ED): Accuracy in detecting and localizing specific error instances.
Evaluation metrics include precision, recall, and F1-score, calculated by:
4. Baseline Performance and Observed Results
Quantitative evaluation on QuranMB.v1 reveals significant challenges:
- The best performing baseline (mHuBERT, trained on both CMV-Ar and TTS data) achieves an F1-score of 29.88% and a phoneme error rate (PER) of about 16.42%.
- Multilingual models outperform monolingual English-pretrained counterparts (wav2vec2, HuBERT, and WavLM), showing lower PER and higher TA—across both canonical and error-augmented data.
- Overall F1-scores remain below 30%, indicating the strictness and difficulty of fine-grained mispronunciation detection for MSA recitation in the context of Qur’anic standards.
This outcome highlights both the inherent complexity of Arabic speech—especially with variable diacritics, dialectal influences, and recitation rules—and the limitations of current SSL-based methods in generalizing over subtle phonetic differences critical for Qur’anic accuracy.
5. Error Simulation and Evaluation Methodologies
A central aspect of QuranMB.v1 is the systematic error simulation based on confusion matrices from prior phoneme similarity analyses (e.g., Speechblender). Errors are injected into synthetic speech data through controlled substitutions and diacritic manipulations. Evaluation is fully supervised, leveraging both real and synthetic data with complete error ground-truth.
The benchmark’s rigorous taxonomy allows for robust assessment beyond simple accuracy. By enumerating TA, FA, FR, CD, and ED, researchers gain fine-grained diagnostic insight into the models’ robustness—crucial for computer-aided pronunciation teaching (CAPT) and automated recitation evaluation.
6. Challenges in Arabic Pronunciation Assessment and Future Outlook
QuranMB.v1 exposes several key challenges:
- Arabic’s complex phonology, especially in the context of Qur’anic recitation, leads to subtle mispronunciations with potentially high impact on meaning.
- Native dialect interference, omission of diacritics, and speaker variability further complicate the task.
- The low baseline performance of state-of-the-art models demonstrates a need for improved architectures, more diverse datasets (including L2 speaker data), and advanced acoustic representations tailored to the unique requirements of the Arabic language and Qur’anic recitation.
By standardizing the benchmark, the paper provides a foundation for future research in Arabic CAPT, including the exploration of novel modeling approaches, richer evaluation methodologies, and integration of both synthetic and authentic recitation data.
7. Impact and Significance for Arabic Language Technology
The introduction of QuranMB.v1 enables reproducible and comparable research in MSA and Qur’anic recitation mispronunciation assessment. Its detailed documentation of data curation, phoneme annotation, error construction, and evaluation protocols sets a high standard for future work—encouraging the development of more sophisticated models and more nuanced diagnostic tools.
A plausible implication is that further refinement of the benchmark and expansion of the dataset to include broader accents, dialects, and L2 speaker varieties will accelerate advances not only in Qur’anic recitation teaching tools but also in general Arabic language technology and speech assessment systems.
QuranMB.v1 thus represents a key milestone in the construction of unified benchmarks for Arabic speech processing, providing the necessary groundwork for future studies in automatic, reliable, and fine-grained pronunciation assessment for high-stakes contexts such as Qur’anic recitation (Kheir et al., 9 Jun 2025).