Tasmeea Algorithm for Quranic Transcript Verification
- Tasmeea algorithm is a transcript verification system that validates ASR-generated Quranic recitations by normalizing texts and employing edit-distance metrics.
- It integrates into an automated pipeline processing over 850 hours of audio, ensuring reliable transcript alignment and minimal errors for downstream analyses.
- The algorithm’s sliding window method with configurable thresholds directly contributes to low phoneme error rates, enhancing the performance of pronunciation error detection systems.
The Tasmeea algorithm is a transcript verification system developed to ensure the integrity and alignment of speech transcriptions with canonical reference texts in large-scale, automated data pipelines for Quranic recitation. It is integral to the construction of high-quality datasets used for automatic speech recognition (ASR) and pronunciation error detection, especially where the fidelity of text-audio alignment directly impacts supervised learning efficacy and the reliability of downstream linguistic analyses (Abdelfattah et al., 27 Aug 2025).
1. Design and Purpose of the Tasmeea Algorithm
The Tasmeea algorithm was explicitly developed for transcript verification within a Quranic recitation data processing framework. Its primary objective is to confirm that ASR-generated transcriptions for segmented recitation audio precisely correspond to the canonical content of the Quran. The algorithm normalizes each segment's text—removing diacritics and extraneous spaces—then performs a sliding window alignment between the normalized candidate output and the digitized Quran text (such as the Tanzil digital edition).
Matching is quantified via the edit distance metric. For each windowed comparison, the edit distance is computed and transformed into a matching ratio (acceptance_ratio). The best-scoring window determines both the localized transcript match and the acceptance decision, using thresholds (e.g., acceptance_ratio ≥ 0.5) to filter out misalignments and transcription errors. This process enforces strict quality control before a segment is included in the final dataset or used as training supervision.
2. Integration in Automated Data Collection Pipelines
The algorithm operates within a 98% automated pipeline designed for assembling Quranic recitation datasets at scale. The pipeline includes:
- Recording expert recitations (yielding 850+ hours and ~300,000 annotated utterances)
- Segmenting the audio by waqf (pause markers), using a fine-tuned wav2vec2-BERT model
- Transcribing segments with ASR (Tarteel ASR system)
- Employing the Tasmeea algorithm to verify and align each transcript
A segment's transcript only progresses if the Tasmeea alignment procedure verifies a strong correspondence with the canonical script. Any segment failing to pass is flagged for correction or exclusion. This ensures misalignments, missing text spans, and other transcription anomalies are rigorously filtered before the data supports training or evaluation.
3. Algorithmic Details and Acceptance Criteria
The normalization and alignment process relies on several steps:
1. Normalize segment by removing spaces and diacritics.
- For each possible window within the canonical text of length equal to the normalized segment, calculate edit distance between the segment and the window.
- Convert edit distance to a matching ratio: typically matching_ratio = 1 - (edit_distance / window_length).
- Apply an acceptance threshold (e.g., acceptance_ratio ≥ 0.5); if satisfied, the window is considered a match.
The use of overlapping windows and configurable penalties allows for robust matching, accommodating moderate ASR errors, regional recitation variants, or canonical text discrepancies. The precise thresholds and penalties can be tuned for dataset-specific requirements, balancing recall and precision for the verification task.
4. Role within ASR-based Pronunciation Error Detection
Tasmeea's output is foundational for the subsequent ASR-based pronunciation error detection system. Only transcripts with verified, high-confidence alignment are passed to the error detection stage. These serve as trusted ground truth for training and evaluating models that encode phonetic and articulation (sifat) properties according to Tajweed (classical Quranic recitation rules).
The robustness of the alignment directly impacts detection of both phoneme-level and articulation feature errors. Any systematic misalignment or inclusion of erroneous transcriptions would compromise models' error detection sensitivity and specificity, making Tasmeea's acceptance filter critical for downstream reliability.
5. Relationship to the Quran Phonetic Script (QPS)
The verified transcript segments are subsequently transformed into the Quran Phonetic Script (QPS), a domain-specific script tailored for Quranic recitation. QPS contains:
- A phoneme level: encoding Arabic letters, vowel lengths, and recitation marks
- A sifat level: representing ten classical articulation attributes (e.g., hams/jahr, shidda/rakhawa, tafkheem/taqeeq)
The QPS provides necessary linguistic granularity for both supervised learning and detailed pronunciation assessment. By ensuring that segmentation and transcription are validated by Tasmeea before QPS conversion, the system maintains tight coupling between the audio signal, the underlying canonical text, and the phonetic/articulatory representations required for Tajweed-compliant error analysis.
6. Impact on Model Performance and Dataset Quality
Empirical results reported in the context of the multi-level CTC model show an extremely low average Phoneme Error Rate (PER) of 0.16% on the test set, attributed in part to Tasmeea's strict verification regime. The multi-level CTC architecture employs eleven parallel heads (for phoneme and ten sifat attributes), combining their outputs via a weighted average loss:
where indexes the sifat attribute levels.
The centrality of verified ground truth to such performance demonstrates Tasmeea’s impact: high-confidence alignments reduce label noise, allowing deep learning models to achieve higher accuracy, particularly in phonetically and articulatorily rich environments.
7. Availability and Research Significance
All code, datasets, and trained models—including the implementation of the Tasmeea algorithm—are released under an open-source license at https://obadx.github.io/prepare-quran-dataset/. This resource enables broad adoption and further research, whether for scaling to other languages, adapting to alternative recitation traditions, or extending into related tasks such as dialectal ASR and fine-grained articulatory analysis.
A plausible implication is that the Tasmeea algorithm, while designed for Quranic recitation, may generalize to other domains requiring matching between noisy, automatically-produced transcriptions and canonical reference texts, especially in low-resource or highly constrained linguistic contexts.