Papers
Topics
Authors
Recent
2000 character limit reached

Diacritic Restoration Performance

Updated 24 November 2025
  • Diacritic restoration performance is the evaluation of computational techniques that accurately reinsert diacritical marks to enhance pronunciation, semantic disambiguation, and overall text clarity.
  • Methodologies leverage seq2seq RNN and Transformer models trained on extensive, diverse corpora to translate undiacritized text into fully diacritized output.
  • Empirical results demonstrate that increasing dataset diversity can reduce Word Error Rate by over 30% and boost BLEU scores significantly, underscoring the importance of varied training data.

Diacritic restoration performance refers to the empirical effectiveness and efficiency with which computational models restore omitted diacritical marks in written language. These marks are critical for morphological disambiguation, accurate pronunciation, and robust downstream performance across a wide range of computational linguistics, speech, and text-processing tasks. The following sections provide a comprehensive, technical synthesis of experimental practices, evaluation metrics, methodological advances, dataset strategies, and the main insights from recent state-of-the-art studies, with a particular emphasis on Yoruba-language research as detailed in recent literature.

1. Corpus Development and Characteristics

The Yoruba writing system is characterized by extensive use of both tonal and orthographic diacritics. However, electronic and contemporary texts overwhelmingly omit these marks, due to technical and educational constraints. Recent efforts have focused on dataset cultivation by aggregating, OCRing, human-correcting, and integrating a heterogeneous body of texts—including religious corpora (JW300 ≈ 11.5M words, various Bible translations), colloquial blogs, spoken and written interview transcripts, language-technology corpora, lexica, and literary texts—yielding a master corpus of ≈13.3 million words from over a dozen distinct sources. This breadth supports robust training and permits evaluation on a held-out, contemporary general-purpose test set (Global Voices Yoruba news, ≈28,308 words), specifically selected to reflect modern, journalistic, and colloquial language.

A notable finding is that in such corpora, approximately 85% of Yoruba words contain at least one diacritic, and 32% of the unique non-diacritized word types are ambiguous (i.e., have more than one observed diacritized form in the data), with the mean lexical ambiguity ratio LexDif ≃ 1.47.

2. Model Architectures and Training Protocols

Recent approaches recast Automatic Diacritic Restoration (ADR) as a sequence-to-sequence (seq2seq) neural machine translation problem, mapping undiacritized to fully diacritized text. Two primary model classes have been thoroughly evaluated:

  • Soft-attention RNN-based seq2seq models: These employ two-layer LSTM encoder and decoder stacks with hidden and embedding dimensions of 500, soft-attention mechanisms (Bahdanau et al.), dropout (0.3), and Adam optimization (β₁=0.9, β₂=0.999, initial lr = 1e-3). Training is run for 20 epochs with batch size 64 and gradient clipping of 5.0.
  • Transformer models: Following Vaswani et al., these use six encoder and six decoder layers, model dimension 512, position-wise feed-forward inner dimension 2048, 8 attention heads, and a dropout of 0.1. Training involves batch sizes of 4096 tokens and label smoothing (0.1), employing Adam with warmup. Optionally, input embeddings are initialized from pretrained FastText vectors trained on massive Yoruba text corpora.

All preprocessing involves Unicode normalization, stripping of all NonSpacingMark codepoints, and construction of parallel (undiacritized→diacritized) examples.

3. Evaluation Metrics

Multiple orthogonal metrics are used to evaluate diacritic restoration performance:

WER=S+D+INref\textrm{WER} = \frac{S + D + I}{N_{\mathrm{ref}}}

where SS = number of word substitutions, DD = deletions, II = insertions, NrefN_{\mathrm{ref}} = number of reference words.

  • Diacritization Error Rate (DER):

DER=NerrNdia×100%\textrm{DER} = \frac{N_\mathrm{err}}{N_\mathrm{dia}} \times 100\%

where NerrN_\mathrm{err} is the total number of incorrectly restored diacritic characters, and NdiaN_\mathrm{dia} the total number of positions that should carry a diacritic.

  • BLEU score: To measure n-gram overlap between system outputs and gold standard.
  • Model perplexity: As a measure of predictive uncertainty, computed on the model’s own outputs.

No finer-grain per-diacritic confusion, token-level precision/recall, or multi-reference evaluations were performed; all reported evaluations are single-reference.

4. Experimental Results

Incrementally aggregating more diverse training texts, especially high-coverage religious corpora such as JW300, yields substantial improvements. Core empirical findings, drawn from the held-out Global Voices test set, include:

Model/Data BLEU Perplexity WER (%)
Soft-attention (baseline) 26.53 1.34 58.17
+Language ID corpus 42.52 1.69 33.03
+Interview text 42.23 1.59 32.58
+All new text (excl. JW300) 43.39 1.60 31.87
+All new text (full 13 sources) 59.55 1.44 20.40
+All new text + FastText 58.87 1.39 21.33
Transformer + All new text (13 src) 59.05 1.40 23.10
Transformer + All new text + FT 59.80 1.43 22.42

The ablation excluding JW300 demonstrates a loss of ≈15 BLEU points and ≈11% increase in WER, underlining the dominance of large, diverse corpora. Transformer models match or surpass RNN counterparts in overall WER and demonstrate marginally superior handling of rare tokens and code-switch phenomena. Both architectures, after comprehensive domain coverage, restore ≈80% of words correctly in contemporary, out-of-domain news text.

No confidence intervals or formal test-significance results were provided; the observed differences are on the order of 1–11 percentage points, well above typical statistical noise in NLP evaluation.

5. Error Analysis and Model Behaviors

Ablation and error analysis emphasize that:

  • Data volume and diversity are stronger determinants of absolute performance than incremental architectural refinement: augmenting the training set reduces WER from ≈58% to ≈20%.
  • Exclusion of JW300 leads to consistent performance drops across both RNN and Transformer models.
  • No token-level or per-diacritic error matrix has been reported; anecdotal inspection found the Transformer model might “fall back” to emitting the input token (without diacritics) when unsure, and often subsequently recovers contextually downstream.
  • There is no reported in-depth breakdown between types of diacritic errors (e.g., tonal vs. underdot vs. overdot), but errors cluster around ambiguous or OOV forms.

6. Implications, Challenges, and Recommendations

Researchers conclude that off-the-shelf Transformer models, once exposed to sufficient and diverse training data, can achieve near-parity with RNN models and provide ≈80% word-level restoration accuracy in contemporary Yoruba. The dominant determinant is coverage of diverse linguistic phenomena, rather than architectural sophistication.

Limitations include:

  • Absence of token-level scores, per-category confusion matrices, and multi-reference or expert human evaluation—which is particularly important for a highly variable tonal language.
  • Single-reference BLEU/WER can underestimate acceptability in cases where many valid diacritized forms exist.
  • Generalization to domain-specific (legal, medical, social media) contexts remains untested.

Recommended future directions include techniques such as:

  • Inclusion of contextualized subword or character-level embeddings (analogous to BERT) for further gains.
  • Introduction of specialized, fine-grained annotation distinguishing error classes (e.g., tone vs. dot).
  • Multi-reference evaluations and human readability studies to fully capture utility and model limitations.
  • Deployment of model distillation and lightweight on-device architectures for resource-constrained applications.

7. Code, Data Availability, and Open Science

All models, data splits, and source code referenced in these studies are publicly released through repositories such as https://github.com/Niger-Volta-LTI/yoruba-adr and https://github.com/Niger-Volta-LTI/yoruba-text. This open-science approach facilitates reproduction, benchmarking, and further downstream application in Yoruba language technology, including improved spell-checking, TTS pipelines, and NLP search.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Diacritic Restoration Performance.