Qur'anic Mispronunciation Benchmark (QuranMB.v1)
- QuranMB.v1 is a comprehensive benchmark that standardizes data curation, annotation, and evaluation for identifying segmental and suprasegmental mispronunciations in Quranic recitation.
- It employs advanced techniques such as self-supervised learning, multi-level CTC modeling, and synthetic mispronunciation generation to accurately assess Tajweed adherence.
- The framework facilitates robust system comparisons, supports open-source research, and drives innovation in automated pronunciation assessment and Quranic recitation training.
The Qur’anic Mispronunciation Benchmark (QuranMB.v1) is a standardized, publicly available test set and evaluation framework designed to rigorously assess automatic mispronunciation detection in Quranic recitation. QuranMB.v1 provides a unified protocol for data, annotation, phonetic representation, and evaluation, grounding the measurement of automated pronunciation assessment systems within the highly constrained and phonetically detailed context of Quranic Arabic, including its unique rules of Tajweed. The benchmark integrates recent advances from data curation, machine learning, computational orthography, and specializes in the precise identification of segmental and suprasegmental errors relevant to both native and non-native reciters.
1. Data Collection, Annotation, and Processing Pipeline
QuranMB.v1 construction follows a comprehensive multi-stage pipeline that addresses intrinsic linguistic and data scarcity challenges in Arabic mispronunciation detection (Kheir et al., 9 Jun 2025, Abdelfattah et al., 27 Aug 2025). The process involves:
- Dataset Curation: Primary data are drawn from controlled recordings where native speakers recite 98 selected Quranic verses in Modern Standard Arabic (MSA) style, deliberately injecting specified pronunciation errors. This ensures coverage of a wide range of authentic mispronunciation phenomena. In the large-scale variant, over 850 hours of expert reciter audio with automatic and verified segmentation (at pause points or waqf) is included, with each recording annotated for Tajweed correctness and segmented at linguistically meaningful boundaries (Abdelfattah et al., 27 Aug 2025).
- Synthetic Mispronunciation Generation: Where naturally occurring error data are sparse, synthetic corpora are generated by Text-to-Speech (TTS) systems that produce both canonical and mispronounced renderings. Mispronunciation is controlled through automated modification of either the transcription (by substituting, deleting, or replacing specific phonemes/diacritics based on a confusion matrix) or via TTS systems fitted with custom pronunciation lexicons (Kheir et al., 9 Jun 2025).
- Segmentation and Transcription: Segmenters based on fine-tuned Wav2Vec2-BERT models classify and segment the audio at precise waqf points, followed by automatic transcription (using Whisper or Tarteel ASR adaptations for Quranic speech). Automated verification (the Tasmeea algorithm) achieves a 98% automation rate in aligning text and audio, relying on sliding window matching and edit-distance scoring to minimize manual correction (Abdelfattah et al., 27 Aug 2025).
- Phonetic Conversion: Phoneme sequences are derived by applying advanced phonetizers, which map fully vowelized text to a standardized inventory, with coverage for diphones, gemination, and MSA prosody (Kheir et al., 9 Jun 2025).
This pipeline ensures high-fidelity alignment between recited audio, canonical text, and ground-truth phonetic targets, supporting fine-grained error localization and assessment.
2. Phonetic Representation and Tajweed Encoding
Central to the benchmark is the adoption of a computational phonetic script specifically tailored for Quranic and MSA Arabic recitation (Abdelfattah et al., 27 Aug 2025, MartÃnez, 16 May 2025). Notable characteristics include:
- Quran Phonetic Script (QPS): A multi-level system, comprising:
- Phoneme Level: Encodes all Arabic consonants, short and long vowels, as well as gemination and diacritic features specific to Quranic Arabic.
- Sifat Level: Ten distinct parallel tracks, each encoding one articulation characteristic (e.g., hams/jahr, shidda/rakhawa, tafkheem/taqeeq, etc.), crucial for evaluating Tajweed application.
- Contemporary Quranic Orthography (CQO) and Tajweed Layer: The benchmark interfaces directly with digital CQO, which combines the consonantal skeleton (QCT) with an overlay of diacritic and Tajweed marks. Computational modules strip, reapply, or transform this layer via sets of regular expressions and linguistic rules, facilitating dynamic generation or transformation of recitation targets for alignment and comparison (MartÃnez, 16 May 2025).
- Phonetic Set Size and Specifics: The canonical MSA benchmark employs a 68-phoneme inventory, which includes explicit geminate representation and, where appropriate, phonetic distinctions such as vague versus emphatic vowels (though these may be collapsed for certain tasks) (Kheir et al., 9 Jun 2025).
This approach enables rigorous mapping between the orthographic, phonetic, and auditory realities of recitation, essential for detecting both surface and latent pronunciation errors.
3. Baseline Architectures and Modeling Approaches
The evaluation of QuranMB.v1 relies on state-of-the-art modeling paradigms for sequence-to-sequence mispronunciation detection. Leading approaches include:
- Self-Supervised Learning (SSL) Frontends: Models such as Wav2vec2, HuBERT, WavLM, and the multilingual mHuBERT provide acoustic feature extraction pipelines with robust representation learning from limited labels (Kheir et al., 9 Jun 2025). The features from these encoders feed into bidirectional LSTM heads for sequence modeling.
- Multi-Level CTC Modeling: The QPS-based systems employ multi-task CTC (Connectionist Temporal Classification) heads—one for each phonetic or sifat attribute. The total loss is computed as a weighted sum:
where is the phoneme-level CTC loss and – are for the sifat tracks (weighted to address the vocabulary and learning difficulty differentials) (Abdelfattah et al., 27 Aug 2025).
- Monolingual vs. Multilingual Models: mHuBERT, trained on 147 languages, consistently outperforms monolingual SSL architectures, delivering true acceptance rates up to 87% and F1-scores approaching 30% when trained on combined real and synthetic data (Kheir et al., 9 Jun 2025).
- Auxiliary Systems: The pipeline includes specialized ASR models for segmentation and transcription, as well as custom alignment (Tasmeea) and verification algorithms (Abdelfattah et al., 27 Aug 2025).
This modeling ensemble supports simultaneous detection and localization of mispronunciation across segmental and suprasegmental layers.
4. Performance, Metrics, and Baseline Results
QuranMB.v1 employs the following metrics for system evaluation:
| Metric | Use | Benchmark Value (Best Reported) |
|---|---|---|
| Phoneme Error Rate (PER) | Segmental mispronunciation assessment | 0.16% (multi-level CTC on QPS, (Abdelfattah et al., 27 Aug 2025)) |
| F1-Score | Precision/recall tradeoff for error detection | ~29–30% (mHuBERT, (Kheir et al., 9 Jun 2025)) |
| True Acceptance | Correct error detection rate | Up to ~87% (Kheir et al., 9 Jun 2025) |
Phoneme Error Rate is computed as the proportion of substitutions, deletions, and insertions necessary to align predicted and reference phoneme sequences. Additional layers of evaluation consider per-attribute error rates (for each sifat), and the system is capable of diagnosing error types such as omitted diacritics, incorrect gemination, or Tajweed misapplication.
Reported results show that phoneme-level performance can reach PERs as low as 0.16% in expert reciter test sets when leveraging the multi-level CTC architecture with extensive ground-truth annotation (Abdelfattah et al., 27 Aug 2025). However, for the more challenging case of controlled mispronunciation detection in diverse speakers, F1-scores remain below 30%, pointing to the intrinsic complexity of the task and motivating further research (Kheir et al., 9 Jun 2025).
5. Challenges, Limitations, and Future Directions
Several technical and scientific challenges are highlighted by the QuranMB.v1 literature:
- Data Scarcity and Diversity: There remains a lack of large-scale, naturally occurring error datasets, especially from non-native or L2 reciters. Synthetic augmentation mitigates this but may under-represent subtle, real-world mispronunciations (Kheir et al., 9 Jun 2025).
- Annotation Complexity: Fine-grained, multi-level (phoneme, sifat, suprasegmental) annotation requires robust, partially automated pipelines and significant verification effort, as in the Tasmeea post-processing algorithm (Abdelfattah et al., 27 Aug 2025).
- Modeling Limitations: Despite high accuracy on expert reciter benchmarks, error detection for "in-the-wild" recitation with variable mispronunciation presents challenges, including low precision in F1 metrics and the difficulty of capturing prosodic and context-sensitive errors.
- Script and Orthography Variations: Applying the benchmark to diverse recitation traditions or new manuscript encodings requires rigorous alignment strategies, given the highly codified CQO/Tajweed overlay (MartÃnez, 16 May 2025).
Planned improvements include extending authentic error collection (particularly from L2 learners), refining multi-tier annotation (possibly incorporating dynamic context-aware phoneme sets), and integrating advanced augmentation or transfer learning techniques to boost generalizability (Kheir et al., 9 Jun 2025).
6. Applications and Impact
The QuranMB.v1 benchmark is positioned as a foundational resource for Arabic speech assessment research and practical applications:
- Automated Pronunciation Training: It supports the development of intelligent tutoring systems and mobile applications that deliver precise, Tajweed-aware feedback for learners lacking access to expert instructors (Abdelfattah et al., 27 Aug 2025, Shaiakhmetov et al., 30 Mar 2025).
- Linguistic and Manuscript Studies: The computational methods for Tajweed layer manipulation and CQO alignment enable detailed scholarship into prosodic and phonetic variation across Qur’anic manuscripts and recitation traditions (MartÃnez, 16 May 2025).
- Model Evaluation and Comparison: QuranMB.v1 establishes public baselines and protocols that allow direct, reproducible comparison across future mispronunciation detection systems, promoting progress and standardization in the field (Kheir et al., 9 Jun 2025).
- Open-Source Ecosystem: All data, code, and models are released under open licenses, stimulating community-driven improvement, benchmarking, and technology transfer for both academic and educational sectors (Abdelfattah et al., 27 Aug 2025).
7. Historical and Research Context
QuranMB.v1 synthesizes decades of progress in Arabic ASR, pronunciation modeling, and digital orthography research:
- Early Methods: Previous systems relied on HMMs, explicit multi-variant phonetic dictionaries, and Bayesian decoding to accommodate pronunciation variability in Arabic speech recognition (Yekache et al., 2012).
- Machine Learning Advancements: The evolution to SVMs, ensemble methods, and tailored deep learning models (LSTM, CNN-GRU, EfficientNet-B0) provided incremental improvements in Tajweed rule recognition and segmental error detection (Alagrami et al., 2020, Harere et al., 2023, Harere et al., 2023, Shaiakhmetov et al., 30 Mar 2025).
- Scriptural Encoding and Digital Philology: CQO and the associated computational tools for Tajweed layer manipulation represent a bridge between historical orthographic practices and high-precision computational analysis (MartÃnez, 16 May 2025).
- Benchmark Standardization: QuranMB.v1 is the first large-scale, rigorously annotated, and publicly available framework supporting phoneme-level and attribute-level mispronunciation evaluation in Quranic Arabic, setting new standards for research in computational pronunciation assessment (Kheir et al., 9 Jun 2025, Abdelfattah et al., 27 Aug 2025).
This benchmark represents a pivotal advance, operationalizing both the academic rigor and practical frameworks required for future research and technology in Quranic speech processing.