- The paper presents a novel lemmatization approach using community-curated lexical resources to address the linguistic diversity of Romansh.
- The paper employs rule-based heuristics and a specialized tokenizer, attaining around 80% coverage across a varied corpus of Romansh texts.
- The paper demonstrates a dictionary-based variety identification method with up to 100% accuracy on longer texts, enabling advanced Romansh NLP applications.
rumlem: A Dictionary-Based Lemmatizer and Variety Identifier for Romansh
Introduction
The paper "rumlem: A Dictionary-Based Lemmatizer for Romansh" (2604.11233) introduces a lemmatization and variety identification tool specifically tailored for Romansh, a minority Romance language with highly divergent regional varieties. The paper's primary motivation is the absence of computational linguistic resources for Romansh and the high linguistic diversity among its varieties, necessitating the development of variety-aware NLP applications. The system achieves lemmatization and morphological annotation for all five idioms of Romansh plus the standard form, Rumantsch Grischun (RG), leveraging comprehensive, community-curated lexical databases.
System Architecture and Linguistic Coverage
rumlem utilizes the Pledari Grond dictionary resource, with individual lexical datasets for Sursilvan, Sutsilvan, Surmiran, Puter, Vallader, and RG. The system preprocesses these resources to extract lemma mappings, POS tags, inflectional features, and German translations. Dedicated rule-based heuristics supplement incomplete annotations, producing a resource of over 725,000 unique word forms mapped to approximately 178,000 lemmas.
Preprocessing involves pattern recognition and normalization of dictionary entries, handling highly heterogeneous morphological structures common in minority and low-resource languages. Test cases and manual gold-standard annotations ensure rigorous corpus normalization, contributing to high output consistency.
rumlem processes each Romansh text via these steps:
- Tokenization using a modified Italian Moses tokenizer, optimized for Romansh-specific orthography and contractions.
- Lemmatization employing a dictionary lookup that returns all candidate lemmas and morphosyntactic features for each form.
- Variety Assignment by matching the proportion of recognized forms to each idiom's lexicon, determining the most likely variety, or accepting an explicit user-specified variety.
The evaluation is based on 30,000 Romansh texts covering a wide spectrum of genres and lengths, from short speech transcripts to long-form narrative texts. rumlem attains an average coverage of 80%—that is, it can assign lemmas and morphosyntactic analyses to 77–84% of tokens in a typical Romansh text (punctuation excluded). Coverage plateaus for longer or more standard language varieties but is adversely affected by:
- Out-of-vocabulary items (e.g., proper nouns, neologisms).
- Tokens involving contractions or orthographic variation not encapsulated in lexicons.
- Unannotated forms due to dictionary gaps.
The transparent, dictionary-based paradigm ensures high precision, though it inherits annotation errors and cannot handle contextual (disambiguating) information. Ambiguity in inflectional forms results in multiple analyses per token when context is absent.
Variety and Language Identification
A significant contribution of rumlem is the deployment of dictionary-based variety identification. For each input text, rumlem computes match proportions for each variety lexicon, predicting variety with 95% accuracy overall. On texts over 300 tokens, variety identification accuracy converges to 100%. These results are robust across genres and text lengths and demonstrate superiority to prior SVM-based classifiers on shorter or out-of-domain data, which exhibit F1 scores as low as 0.7.
rumlem further functions as a Romansh vs. Romance language identifier. In experiments with 5,000 texts across French, Italian, Catalan, Romanian, and Romansh, rumlem's distributions for recognized Romansh versus non-Romansh texts are well separated. An empirically determined threshold on variety-specific match ratios yields near-perfect language identification, with misclassifications arising solely from highly noisy or code-switched input.
Limitations
Several inherent limitations are identified:
- Incompleteness of lexicons restricts coverage, particularly for dialectal, rare, or newly coined forms.
- Ambiguity is unresolved in the absence of contextual disambiguation, as is typical for context-agnostic systems.
- The reliance on dictionary quality and update frequency underscores the need for continued resource development.
- Licensing restrictions for Vallader and Puter lexicons limit downstream system release and open-source integration.
Implications and Future Directions
This work furnishes an essential building block for Romansh NLP, enabling downstream tasks such as POS tagging, named entity recognition, and variety-annotated corpora development. The methodology of leveraging high-coverage, community-maintained lexical resources for minority languages is generalizable; it demonstrates that dictionary-based systems, when supported by high-quality resources, can yield competitive performance—especially when annotated corpora and treebanks are unavailable.
Future advances are likely to emerge by hybridizing dictionary-based approaches with statistical or neural sequence models capable of contextual disambiguation and OOV generalization. The open-source release (except for certain idioms) encourages further contribution and adaptation.
Additionally, variety identification derived from lemmatizer statistics may be extended to more fine-grained or hierarchical classification, leveraging sub-variety or register distinctions. Integration with digital spell-checking and machine translation systems for Romansh is a direct application, especially given its transparent and auditable nature—a critical aspect for minority languages with sensitive sociolinguistic status.
Conclusion
rumlem constitutes a rigorously engineered, high-precision lemmatizer and variety identifier specifically optimized for Romansh and its major varieties. Its reliance on robust morphological lexicons yields high coverage and accuracy for both lemmatization and text variety attribution. While constrained by dictionary completeness and context-agnostic analysis, the system addresses critical resource and infrastructure gaps in Romansh computational linguistics and supplies a foundation for future language technology development in low-resource, high-variation environments.