Papers
Topics
Authors
Recent
2000 character limit reached

Diacritic Error Rate (DER)

Updated 5 December 2025
  • Diacritic Error Rate (DER) is defined as the ratio of diacritic mismatches to total diacritic positions in a reference text, providing a fine-grained measure of restoration accuracy.
  • It is computed by aligning predicted and reference texts at the character level, tallying insertions, deletions, or substitutions, and normalizing by the error base.
  • Applications span TTS, ASR, and NLP, with benchmarks in languages like Yorùbá, Arabic, Sanskrit, and Romanian guiding model improvements and evaluation protocols.

Diacritic Error Rate (DER) quantifies the accuracy of automatic diacritic restoration by measuring the proportion of diacritic marks that are incorrectly inserted, deleted, or substituted relative to a gold-standard reference. As a fine-grained, character-level metric, DER isolates performance on sub-token orthographic phenomena (e.g., vowels, tone, pitch accent) essential to lexical disambiguation, pronunciation, and downstream natural language processing. It is the primary performance measure in diacritization research for a range of languages including Yorùbá, Arabic, Sanskrit, and Romanian.

1. Formal Definitions and Variants

DER is defined as the normalized count of diacritic errors compared to the total number of reference diacritic positions. Let NN be the total number of character positions in the reference text at which a diacritic appears, and EE the number of mismatches due to missing, incorrect, or spurious diacritics. The standard formula is:

DER=EN\mathrm{DER} = \frac{E}{N}

Alternatively, when distinguishing error types—insertions (II), deletions (DD), and substitutions (SS):

DER=I+D+SN\mathrm{DER} = \frac{I + D + S}{N}

In Romanian LLM evaluation, DER is expressed as the sum of character-level Levenshtein distances divided by the total evaluated character count:

DER=i=1NLev(s^i,si)i=1Nsi\mathrm{DER} = \frac{\sum_{i=1}^N \mathrm{Lev}(\hat{s}_i, s_i)}{\sum_{i=1}^N |s_i|}

Language- and task-specific implementations provide further granularity. For example, in Yorùbá, DER counts only reference positions carrying diacritics, ignoring errors at positions where the gold standard has none, while for Arabic, variants include or exclude case endings and treat "no-diacritic" as a valid class.

2. Practical Computation and Evaluation Protocols

DER is generally computed at the character level by aligning reference and predicted sequences and summing mismatches at all diacritic positions. Common computation steps include:

  • Stripping diacritics to form undiacritized input and aligning output and gold-labeled reference at the character level.
  • For each diacritic-bearing character position, incrementing EE if the system and reference differ, whether by omission, commission, or substitution.
  • For edit-distance formulations, Levenshtein alignment isolates II, DD, and SS.
  • In some contexts, only positions in the gold reference with a diacritic are included, not all possible character slots.

Evaluation practices may further exclude punctuation, digits, or positions with ambiguous diacritization, and normalization steps (e.g., Unicode decomposition/recomposition, orthographic conventions) are often essential for robust scoring.

3. Reported DER Results Across Languages and Models

DER enables direct comparison of diacritization systems and model architectures.

Language Best DER (%) Model/Reference Notes
Yorùbá 4.6 Transformer Self-attention (Orife, 2018) Character-level, ~100k words
Arabic 2.88 Shakkala neural system (Fadel et al., 2019) Excluding case endings
Sanskrit 6.85 ByT5 fine-tuning (P et al., 28 Nov 2025) Accent (pitch-mark) aware
Romanian 0.054 GPT-4o, 3-shot prompt (Nadas et al., 17 Nov 2025) Decimal percentage; CL char-level
Vietnamese 1.47 BERT diacritizer (Náplava et al., 2021) Alpha-word accuracy

A plausible implication is that model architecture, training set size and domain, and prompt complexity (in LLMs) are decisive factors: self-attention or multi-shot-prompted models consistently achieve the lowest DERs. For languages with high orthographic ambiguity (Yorùbá, Arabic), attaining sub-5% DER enables use in sensitive tasks such as TTS and ASR.

4. Error Analysis and Model Diagnostics

DER supports detailed analysis of error patterns, such as:

  • Systematic confusions between closely related diacritic marks (e.g., e/ẹ, o/ọ in Yorùbá, â/î in Romanian, udātta/svarita in Sanskrit).
  • Tone diacritic confusion clusters in verbs or sandhi contexts (Yorùbá).
  • Over-generation of diacritics ("hallucination") in inadequately prompted LLMs.
  • Inconsistent or noisy gold-label diacritization, leading to apparent DER inflation and necessitating data cleaning via embeddings or manual correction.

Attention visualizations and manual error audits further contextualize DER results, revealing when multiple valid restorations exist or gold annotations are themselves erroneous.

5. Advantages, Limitations, and Complementary Metrics

Advantages:

  • DER directly measures the fidelity of diacritic restoration, providing sub-word orthographic granularity overlooked by token-based metrics.
  • It enables targeted model improvement—diagnosing specific diacritic or tonal weaknesses.
  • DER robustly captures per-character phenomena crucial to applications in TTS, ASR, and morphosyntactic disambiguation.

Limitations:

  • DER assigns equal weight to all diacritics, disregarding their frequency or linguistic importance.
  • It is blind to the actual impact of errors on word-level correctness—a single diacritic error may result in 100% DER for that word.
  • Annotation conventions and dataset inconsistencies can lead to unfair penalization.
  • DER may not reflect downstream impacts such as semantic or syntactic error propagation, and ignores higher-order phonological or metrical correctness (Sanskrit).

As a result, DER is best reported alongside word error rate (WER), character error rate (CER), or language-specific metrics (e.g., CWER, CEER in Arabic), and comprehensive manual or qualitative analyses.

6. Impact on Downstream Tasks and Recommendations

Sub-5% DER is considered sufficient for high-quality speech and NLP applications in languages with systematic diacritization (Yorùbá, Romanian, Arabic). Achieving low DER enables:

  • Accurate TTS synthesis and ASR transcripts where diacritics or tonal marks are lexically and phonetically meaningful.
  • Precise POS-tagging, parsing, and information retrieval in morphologically rich languages.
  • Adoption in language tools (e.g., keyboard extensions, browser plugins) that can automate over 90% of manual diacritic restoration tasks.
  • Use in digital philology and OCR for precise transcription of heritage texts (Sanskrit, Rigveda).

Recommendations for lowering DER include expanding corpus diversity, cleaning inconsistent annotations, adopting sub-word or byte-level encoding, fine-tuning or prompting with representative orthographic examples for LLMs, and hybridizing with rule-based validators.

7. Perspectives and Future Directions

DER remains the core evaluation metric driving diacritic restoration research. High-performing neural architectures (Transformer, BiLSTM, LLMs) and open-resource initiatives have gradually advanced DER benchmarks. For future progress, focus areas include:

  • Corpus expansion into diverse and colloquial domains to address generalization deficits.
  • Error-log-driven refinement and minimal rule-based post-processing to tackle orthographic idiosyncrasies overlooked even by top models.
  • Further integration of multi-modal or cross-attentional inputs (ASR hypotheses, acoustics) in speech-linked diacritization.
  • Community standards for annotation conventions and public error corpora to promote reproducibility and fair benchmarking.

Continued reductions in DER, and a nuanced understanding of its relation to other measures and downstream task utility, will be pivotal for robust diacritic-aware language technology.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Diacritic Error Rate (DER).