Papers
Topics
Authors
Recent
2000 character limit reached

Qur’anic Mispronunciation Benchmark (QuranMB.v1)

Updated 7 November 2025
  • The paper introduces a comprehensive benchmark that integrates phoneme-level annotation, controlled error simulation, and deep learning evaluation pipelines.
  • It combines real and synthetic data using rigorous linguistic protocols to deliver objective, repeatable performance metrics in Qur’anic recitation.
  • The benchmark enhances diagnostic feedback in Tajweed pedagogy, achieving up to 29.88% F1 score and over 95% accuracy on rule-specific mispronunciation tasks.

The Qur’anic Mispronunciation Benchmark (QuranMB.v1) defines a new standard for automated assessment and research in Arabic pronunciation, placing particular emphasis on Qur’anic recitation errors and Tajweed rule violations. It operationalizes rigorous linguistic annotation protocols, phoneme-level test data, error-specific evaluation, and model benchmarking pipelines rooted in recent advances in speech recognition, deep learning, and linguistic knowledge engineering. This resource enables objective, repeatable, and fine-grained assessment of mispronunciation detection systems, addressing enduring challenges in Arabic CAPT (Computer-Aided Pronunciation Training) and Tajweed pedagogy.

1. Historical Motivation and Benchmark Origins

Traditional Qur’anic recitation pedagogy relies on expert supervision to guide learners in proper Tajweed application. Automated systems have long suffered from limited rule coverage, unbalanced and insufficiently labeled data, and an inability to produce actionable, diagnostic feedback (Al-Kharusi et al., 14 Oct 2025). The absence of standardized benchmarks has hampered the development and comparative evaluation of mispronunciation detection models. QuranMB.v1 directly responds to these needs, synthesizing methodological advances in dataset curation, error simulation, phoneme inventory definition, and evaluation metrics into a unified framework (Kheir et al., 9 Jun 2025).

2. Dataset Construction and Annotation Protocols

QuranMB.v1’s primary release consists of a fully documented, large-scale test set for mispronunciation detection in Modern Standard Arabic (MSA), using Qur’anic recitation as its domain. Test data comprises 98 verses read by 18 native speakers (14 female, 4 male; ~2.2 hours total) with controlled error induction guided by a custom interface. Each utterance is fully vowelized and annotated at the phoneme level, reflecting both canonical and errorful pronunciations. Error patterns are constructed by mapping letters and diacritics to common confusion pairs or deletion patterns, informed by contemporary phonetic studies and confusion matrices (Kheir et al., 9 Jun 2025). Annotation employs the Nawar Halabi phonetizer for consistent transcript-phoneme alignment. The benchmark pipeline integrates real and synthetic data sources, including TTS-generated errorful speech and filtered segments from CMV-Ar (82.37 hours: Mozilla Common Voice v12.0, plus Qur’anic passages).

A parallel resource (“QuranMB.v1” in (Abdelfattah et al., 27 Aug 2025)) expands the scope via a 98% automated pipeline that scales to ~890 hours and 300,000 utterances, processed from expert reciters with high-precision segmentation at waqf points using a fine-tuned wav2vec2-BERT model. Each utterance is annotated in the Quran Phonetic Script (QPS), a domain-specific encoding for Tajweed-aware phonemes and 10 articolatory “sifat” attributes per segment, capturing comprehensive recitation features. Annotation accuracy is ensured by the “Tasmeea” algorithm, with manual review on only ~2% of the data.

3. Phoneme Inventory, Tajweed Encoding, and Error Simulation

The phoneme inventory underlying QuranMB.v1 employs 68 units, encoding all consonants, short/long vowels, and geminated forms. The QPS script (in (Abdelfattah et al., 27 Aug 2025)) adds a second level of granularity by recording Tajweed-relevant articolatory attributes (“sifat”) for each phoneme, covering hams/jahr, qalqala, tafkheem, ghunna, and other dimensions crucial to faithful Qur’anic recitation.

To address data scarcity, synthetic mispronunciation samples are generated using controlled TTS pipelines and advanced data augmentation frameworks such as SpeechBlender (Kheir et al., 2022). SpeechBlender leverages region-level waveform masks and blending coefficients to interpolate between correctly and incorrectly pronounced phoneme segments, reconstructing a spectrum of plausible L2 error types. The generated augmented samples are labeled and integrated into both training and test suites, enhancing diversity and coverage of mispronunciation categories. The framework produces both strong errors (low candidate mixing) and subtle accenting (partial blends), outperforming naïve Cut/Paste methods (F1 +4.63% on AraVoiceL2).

4. Evaluation Protocols and Metrics

QuranMB.v1 standardizes evaluation via task definitions and hierarchical metrics. Test samples are annotated for both canonical and errorful pronunciations; prediction categories include True Accept (TA), False Reject (FR), False Accept (FA), Correct Diagnosis (CD), and Error Diagnosis (ED). From this, precision, recall, F1-score, and Phoneme Error Rate (PER) are calculated:

Precision=TRTR+FRRecall=TRTR+FA\text{Precision} = \frac{TR}{TR + FR} \qquad \text{Recall} = \frac{TR}{TR + FA}

F1=2Precision×RecallPrecision+RecallF_1 = 2 \cdot \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}

Models are benchmarked on both detection (correct/wrong) and classification (error type, location) tasks. Baseline systems utilize SSL encoder + BiLSTM + CTC architectures, with comparison across Wav2vec2, HuBERT, WavLM, and mHuBERT. Addition of TTS-augmented data raises best F1 to 29.88%, best PER to 16.42% for mHuBERT (Kheir et al., 9 Jun 2025). In rule-specific DNN systems (EfficientNet-B0 + Squeeze-and-Excitation on mel-spectrograms), per-rule accuracy for Tajweed achieves 95.35% (Al Mad), 99.34% (Ghunnah), and 97.01% (Ikhfaa) for multi-label binary classification, with weighted losses addressing class imbalance (Shaiakhmetov et al., 30 Mar 2025).

5. Comparison to Prior Systems and Benchmarking Paradigms

Prior systems relying on repurposed ASR architectures are critiqued for prioritizing lexical insensitivity and failing in diagnostic feedback, scalability, and demographic fairness (Al-Kharusi et al., 14 Oct 2025). Deep learning pipelines (e.g., EfficientNet, SSL encoders) exhibit high capacity for Tajweed error detection only when supported by balanced, expert-annotated data and task-specific optimization, as exemplified by the QDAT workflow (Shaiakhmetov et al., 30 Mar 2025). SVM-based approaches such as Smartajweed (Alagrami et al., 2020) employ filter bank spectral features with frame-wise scoring and rule-specific binary classifiers, realizing 99% accuracy on validation but requiring continual expert review and threshold adjustment.

The transition toward knowledge-centric, rule-driven architectures is outlined as both a technical and theological imperative (Al-Kharusi et al., 14 Oct 2025). Such systems encode Tajweed rules and articulation points (Makhraj) as computational objects, enabling granular localization of errors, anticipatory acoustic modeling, and interpretability. This approach is recommended for future iterations of QuranMB, with full-scale benchmarks to be updated with rule-engine integration, diverse demographic representation, and systematic expansion to L2 error phenomena.

6. Applications, Limitations, and Future Directions

QuranMB.v1 provides a reproducible reference for research and deployment of mispronunciation detection systems in both academic and educational settings. It enables rigorous model comparison, objective error analysis, and supports pedagogically actionable feedback. Interactive systems can build upon the robust per-class feedback mechanisms, multi-label annotation schemes, and Tajweed-aware attributes to deliver real-time diagnostics and adaptive learning modules.

Remaining challenges include the collection and annotation of authentic errorful recitations for Tajweed benchmarking, as current large-scale resources (e.g., (Abdelfattah et al., 27 Aug 2025)) focus primarily on expert “golden” reciters. Synthetic augmentation methods like SpeechBlender can mitigate some data scarcity, though further progress depends on longitudinal acquisition of L2 error data, robust phoneme inventory extension, and continued expert validation cycles. The critical review literature (Al-Kharusi et al., 14 Oct 2025) emphasizes the need to reconcile data-driven methods with explicit linguistic and theological rule bases for robust, equitable, and scalable evaluation tools.

7. Summary Table: Core Attributes of QuranMB.v1 (abridged)

Aspect Value / Description Comments
Test set 98 verses; 18 speakers; ~2.2h; phoneme-annotated errors Controlled error induction; interface-guided recording
Train data 82.37h CMV-Ar + 52h TTS (canonical/errorful) Combines real/synthetic sources
Phoneme inventory 68 units (MSA); QPS: 43 Tajweed-aware units + 10 sifat attributes Enables fine-grained Tajweed error detection
Model baseline SSL encoder + BiLSTM + CTC; EfficientNet-B0+SE for Tajweed rules mHuBERT achieves best F1/PER add TTS data
Key metric F1 = 29.88%; PER = 16.42%; Tajweed rule accuracy >95% (DNN) Indicates headroom for future improvement
Resource links https://huggingface.co/IqraEval; https://obadx.github.io/prepare-quran-dataset/ Public access; reproducible recipes

QuranMB.v1 and its associated pipelines form the foundation for reproducible, diagnostically informative, and linguistically rich benchmarking of Qur’anic mispronunciation detection technologies, directly addressing longstanding deficits in the field and catalyzing further research, refinement, and application.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Qur’anic Mispronunciation Benchmark (QuranMB.v1).