Automatic Diacritic Restoration
- Automatic Diacritic Restoration is the process of recovering missing tonal and orthographic diacritics to restore complete linguistic information in text.
- Recent work leverages large, diverse multi-source corpora, reducing word error rates from 58% to around 20% in applications like news and text-to-speech.
- Neural models, including RNN-based and Transformer architectures enhanced with FastText embeddings, effectively address ambiguity and improve restoration accuracy.
Diacritic restoration performance refers to the empirical accuracy, error rates, and computational characteristics of models that recover missing orthographic and tonal diacritics in text—transforming undiacritized input into linguistically complete output. This task is central to information retrieval, text-to-speech, ASR, and NLP pipelines for languages such as Yorùbá, where diacritics are vital for disambiguation and morphophonological interpretation but are often omitted in digital writing due to technical and sociolinguistic factors. Advances in corpus creation, neural modeling, evaluation, and real-world deployment have defined the current landscape of diacritic restoration, with gains tightly coupled to both methodological choices and data scale.
1. Data Regimes and Evaluation Sets
Performance in diacritic restoration is fundamentally shaped by dataset composition and variety. Early works relied on major religious and literary corpora, but recent efforts have expanded to general-purpose, multi-source aggregates. Specifically, a consolidated Yorùbá parallel corpus now spans approximately 13.3 million words across thirteen diverse sources, including spoken–written comparisons, colloquial blogs, modern translations, traditional proverbs, and OCR-corrected fiction. This diversity is essential for modeling lexical, stylistic, and orthographic variability in real-world use.
For evaluation, a held-out slice of the Global Voices Yorùbá news corpus (≈28,308 words) serves as a primary modern, journalistic, and multi-purpose test set. Robust evaluation requires such coverage to reflect contemporary, out-of-domain usage and to measure model generalization.
2. Model Architectures and Training Paradigms
State-of-the-art systems for Yorùbá ADR use sequence-to-sequence neural architectures, reframing diacritic restoration as a translation problem. Two principal architectures anchor current approaches:
- RNN-based Soft-Attention Seq2Seq: Employs 2-layer LSTM encoder–decoders (hidden size 500, embedding dim 500) with Bahdanau-style soft attention. The context vector at each decoder step is computed as:
with various scoring functions (dot-product or additive MLP) for compatibility.
- Transformer: Consists of six encoder and six decoder layers (model dimension 512, feed-forward dim 2048, eight heads per layer). Encoders use multi-head scaled dot-product self-attention and positional encoding. In practice, the Transformer provides stronger modeling of long-range dependencies and tone sandhi, with superior performance on ambiguous or discontiguous morphemes.
Both architectures are implemented in OpenNMT-py, trained on GPUs (NVIDIA V100), and are further evaluated with and without pre-trained FastText input embeddings.
3. Metrics for Diacritic Restoration
Three key evaluation metrics are employed:
- BLEU score: Measures n-gram overlap between system output and gold reference.
- Perplexity: Exponential of negative mean log-likelihood of correct outputs.
- Word Error Rate (WER): Fraction of words with any diacritic error:
where S = substitutions, D = deletions, I = insertions, = number of words in reference.
On earlier datasets, Diacritization Error Rate (DER) is used:
where is the total number of incorrect diacritic characters and is the number that should bear diacritics.
4. Empirical Results and Ablation Studies
The following table summarizes the principal results for several configuration blocks on the held-out Global Voices test set (BLEU, Perplexity, WER):
| Model | BLEU | Perplexity | WER (%) |
|---|---|---|---|
| Soft-attn baseline (Orife 2018) | 26.53 | 1.34 | 58.17 |
| + Language ID corpus | 42.52 | 1.69 | 33.03 |
| + Interview text | 42.23 | 1.59 | 32.58 |
| + All new text (excl. JW300) | 43.39 | 1.60 | 31.87 |
| + All new text (full 13 sources) | 59.55 | 1.44 | 20.40 |
| + All new text + FastText | 58.87 | 1.39 | 21.33 |
| Transformer + All new text | 59.05 | 1.40 | 23.10 |
| Transformer + All new text + FastText | 59.80 | 1.43 | 22.42 |
Key findings:
- Adding diverse sources reduces WER from 58% (baseline) to ≈20% (full heterogeneous).
- The JW300 corpus (≈11.5 million words) is pivotal; omitting it incurs a ≈15 BLEU drop and 11% WER increase. This suggests data volume and domain coverage are more critical than incremental model sophistication.
- The Transformer architecture slightly outperforms RNN models when sufficiently trained, especially on rare token and code-switch examples.
- FastText input embeddings provide a marginal BLEU improvement (+0.8) in Transformers but mixed effects on WER.
No fine-grained token-level, per-diacritic, or confusion-matrix analysis is given; derivations rely on overall accuracy/error rates.
5. Error Analysis and Diagnostic Insights
- JW300 ablation directly shows the relationship between corpus size/domain and diacritic restoration performance; omitting it introduces significant error spikes.
- No breakdown is provided for tone vs. dot errors or other diacritic types; however, observed error patterns include graceful fallback to undiacritized tokens when uncertain and subsequent recovery downstream (plausibly indicating robust local context utilization).
- Model robustness is measured by maintaining WER<24% even when trained on disjoint textual domains.
6. Strengths, Limitations, and Practical Implications
Strengths:
- Large corpus size and genre diversity lead to dramatic error reductions (WER from 58% to 20%) with even baseline neural architectures.
- The volume of training triangles dominates over model micro-architecture once a baseline level of expressivity is achieved.
- Transformer models display slight advantages in generalization to rare and code-switched forms.
Limitations:
- No per-class or token-level error is reported.
- All evaluation is single-reference, meaning that for highly tonal or morphologically ambiguous languages like Yorùbá—where multiple outputs may be linguistically valid—BLEU/word accuracy may understate functional adequacy.
- Performance in formal/legal, medical, or noisy social media domains is unreported, raising possible concerns about domain adaptation.
Practical impact:
- Off-the-shelf Transformer architectures trained on ≈13 million words now restore ≈80% of words correctly in modern news text, supporting improved search indexing, robust text-to-speech, and setting a new bar for downstream Yorùbá NLP.
7. Future Directions
Key open avenues include:
- Introduction of contextualized character/subword models (e.g., BERT or similar) to close the remaining error gap for ambiguous, OOV, or highly tonal cases.
- Fine-grained annotation (by diacritic type or tonal profile) to enable multi-task learning and focused error analysis.
- Systematic multi-reference evaluation protocols and user studies to calibrate perceived readability and acceptability for native speakers, especially in creative or non-standard genres.
- Deployment of lightweight distillations for on-device or mobile applications.
- Release and growth of open-source data, pre-trained models, and code to support rapid iterative research and real-world adoption.
Overall, modern Yorùbá diacritic restoration has attained substantial accuracy improvements via aggressive data scale-up and robust neural architectures, yet remains constrained by reference granularity, domain mismatch, and coverage of low-frequency diacritic patterns. Continuous corpus augmentation, diagnostic error analysis, and innovations in neural modeling are required to close the remaining performance gap for both applied and research use cases.