Ge'ez Morphological Synthesizer

Updated 1 October 2025

Ge'ez Morphological Synthesizer is a computational system that generates complex word forms from root inputs by modeling the language's rich inflectional and derivational morphology.
It employs rule-based techniques and a hybrid statistical tokenization approach, effectively addressing data scarcity and preserving linguistic integrity.
Evaluation on 1,102 verbs shows 97.4% accuracy, highlighting its practical utility in NLP tasks, cultural preservation, and digital manuscript processing.

A Ge'ez Morphological Synthesizer is a computational system for generating surface word forms from root or lemma inputs, respecting the inflectional and derivational morphology characteristic of the Ge'ez language. Ge'ez, an ancient Semitic language and script foundational to the cultures of Ethiopia and Eritrea, exemplifies rich morphological structure with complex verb paradigms and extensive affixation. Development of a synthesizer is complicated by severe data scarcity and resource limitations unique to low-resource, morphologically rich languages. Contemporary research addresses these challenges through precisely engineered rule-based approaches, hybrid statistical-linguistic tokenization, and unsupervised learning for paradigm completion.

1. Morphological Structure and Complexity

Ge'ez demonstrates intricate morphological patterns, particularly in verbal inflection. Each root may produce numerous surface forms through derivational and inflectional processes, including six principal verb forms: perfective, indicative, infinitive, subjunctive, jussive, and gerundive. Each of these forms is further divided into five stem classes; every stem admits additional inflection by the application of subject marker suffixes (SMS) and object marker suffixes (OMS) (Gebremariam et al., 24 Sep 2025).

Surface word synthesis proceeds in three phases:

Stem Formation: Chooses among stem classes arranged as a matrix with rows for TAM (Tense-Aspect-Mood) and columns for stem type.
TAM Formation: Appends SMS encoding person/number features to the stem.
PNG Formation: Optionally adds OMS to encode object agreement features.

Word formation applies ordered rule sets, e.g., Stem + SMS = surface word Stem + SMS + OMS = surface word Affix concatenation is subject to boundary conditions, especially for stems ending in guttural or semivowel characters, requiring orthography-sensitive assimilation and modification.

2. Rule-Based Morphological Synthesis

In the absence of extensive annotated resources, the two-level morphology (TLM) paradigm is applied to encode both morphotactics and orthographic alternations. Synthesis rules model the permissible sequence: [Prefix] + [Prefix Circumfixes] + [Stem] + [Suffix Circumfixes] + [SMS] + [OMS]

Expert knowledge informs manually constructed rules governing morpheme order and context-sensitive spelling changes at affix-stem boundaries. For example, rules are crafted to address glottal assimilation during stem/SMS concatenation.

A dataset of 1,102 verbs, sourced from canonical Ge'ez texts and grammar authorities, underpins the synthesizer’s coverage (Gebremariam et al., 24 Sep 2025). Evaluation—both manual error count and automatic calculation—yielded 97.4% accuracy (99.6% for regular verbs), demonstrating robust handling of both regular and irregular morphophonemic patterns. The rule-based system outperformed baseline models (cf. Abeshu, 2013) by explicitly modeling complex stem–suffix–object marker interplay.

3. Morphology-Aware Tokenization and Hybrid Vocabularies

Statistical subword segmentation methods, such as Byte Pair Encoding (BPE), often fragment morphemes in Ge'ez script languages, undermining morphological integrity. The MoVoC (“Morpheme-aware Subword Vocabulary Construction”) methodology couples supervised morphological analysis (via tools like HornMorpho and expert annotation) with statistical tokenization to produce hybrid vocabularies (Teklehaymanot et al., 10 Sep 2025).

Algorithmic steps are as follows:

Pre-tokenize corpus by morpheme segmentation.
Extract morphemes or generate via rule-based analyzer/manual annotation.
Train BPE subword vocabularies on segmented corpus, parameterized by vocabulary size and a hyperparameter controlling morpheme-to-BPE proportion.
Merge top-ranked frequent morphemes with BPE tokens into a combined set.

Formally, the final vocabulary is $V_{\mathrm{MoVoC}} = V_{\mathrm{BPE},am} \cup V_{\mathrm{BPE},ti} \cup V_{\mathrm{morpheme},am} \cup V_{\mathrm{morpheme},ti}$

MoVoC-Tok, the associated tokenizer, respects morpheme boundaries during BPE merges via constraints: For word $w = (c_1, \ldots, c_m)$ with gold-standard morpheme boundaries $M_i$ : max $_V$ $\sum \log P(\mathrm{BPE}(w; V, M_i))$ subject to: $(a \cup b)$ does not cross any $M_i$ .

Intrinsic evaluation via Morpheme Boundary Precision, MorphScore, and Rényi entropy confirms improved segmentation consistency and linguistic faithfulness over standard approaches.

4. Unsupervised Morphological Paradigm Completion

Contemporary advances enable the induction of morphological structure from raw text corpora, dispensing with annotated paradigms (Wiemerslage et al., 2022). Unsupervised paradigm completion in Ge'ez can proceed as follows:

Clustering: Raw word forms are grouped into paradigm clusters via statistical word form similarity and longest common substring (LCS) computation per cluster, yielding abstract paradigms (e.g., $X_0$ , $X_0+$ ed, $X_0+$ ing, $X_0+$ s).
Slot Alignment & Latent POS Induction: Clusters are assigned latent POS tags by expectation maximization on the Bayesian probability $P(k,c_i) = P(k) \prod_j P(f_j|k)$ , where $k$ is a latent tag and $f_j$ are inflected forms. A similarity metric combining cosine similarity of fastText embeddings and Jaccard similarity, $sim(a,a') = \cos(a,a') \cdot (1−J(a,a'))$ , refines slot grouping.
Slot Prediction: Contextual character-level Transformer models are trained to predict morphological slot and tag for each token, further referencing slot alignment outputs for paradigm completion.

This method scales to Ge'ez by assembling raw text corpora and applying unsupervised learning, mitigating annotated data scarcity. A plausible implication is that cross-lingual transfer and domain-mixed data augmentation may further enhance synthesis robustness.

5. Evaluation, Datasets, and NLP Integration

Ge'ez morphological synthesizer evaluation combines manual expert validation and automatic statistical analysis. In the rule-based system, $26,867$ surface forms (from $1,102$ verbs) yielded $668$ errors, primarily among irregular verbs, for $97.4\%$ overall accuracy (Gebremariam et al., 24 Sep 2025). Boundary-aware tokenization achieves higher MorphoScore and boundary precision, with reduced Rényi entropy (Teklehaymanot et al., 10 Sep 2025).

Resources supporting synthesis comprise curated datasets: morpheme-annotated corpora for Ge'ez, Tigrinya, Amharic, and Tigre—generated via expert annotation and HornMorpho. Open access to these datasets and code on GitHub enables reproducibility and subsequent research.

Morphological synthesizers are further deployable in:

Information Retrieval (surface form recognition)
Spell/grammar checking
Machine Translation
Lexicography (cross-lingual dictionaries)

6. Limitations and Future Research Directions

Current synthesizer architectures are constrained by:

Limited annotated resources in Ge'ez and related languages.
Algorithmic complexity owing to hybrid segmentation methods.
Incomplete morphological coverage, e.g., imperfect modeling of irregular verbs and exceptional morphophonemic phenomena.

This suggests expansion to comprehensive Ge'ez synthesis incorporating all morphological aspects, further rule refinement, and corpus enhancement. Suggested future work includes:

Extending synthesizer coverage beyond verbs to full morphology.
Integrating context-enriched rules and additional linguistic feature modeling.
Fusing synthesizers with broader NLP systems, e.g., IR or MT engines.
Improving cross-lingual interoperability and leveraging domain-mixed data.

Research continues into unsupervised paradigm discovery, multimodal synthesis from digitized manuscripts or speech, and evaluation standards balancing linguistic fidelity with downstream performance.

7. Significance for Linguistics and Cultural Preservation

The Ge'ez morphological synthesizer plays a pivotal role in cultural and linguistic heritage documentation. Accurate synthesis supports academic inquiry, digital manuscript preservation, and revitalization of liturgical and historical corpora central to Ethiopian and Eritrean identity. By establishing foundational computational methods and resources for Ge'ez, these works contribute directly to the advancement of NLP for low-resource, morphologically complex languages and offer methodological templates applicable to similar linguistic contexts.