Papers
Topics
Authors
Recent
Search
2000 character limit reached

RUMLEM: A Dictionary-Based Lemmatizer for Romansh

Published 13 Apr 2026 in cs.CL | (2604.11233v1)

Abstract: Lemmatization -- the task of mapping an inflected word form to its dictionary form -- is a crucial component of many NLP applications. In this paper, we present RUMLEM, a lemmatizer that covers the five main varieties of Romansh as well as the supra-regional standard variety Rumantsch Grischun. It is based on comprehensive, community-driven morphological databases for Romansh, enabling RUMLEM to cover 77-84% of the words in a typical Romansh text. Since there is a dedicated database for each Romansh variety, an additional application of RUMLEM is variety-aware language classification. Evaluation on 30'000 Romansh texts of varying lengths shows that RUMLEM correctly identifies the variety in 95% of cases. In addition, a proof of concept demonstrates the feasibility of Romansh vs. non-Romansh language classification based on the lemmatizer.

Summary

  • The paper presents a novel lemmatization approach using community-curated lexical resources to address the linguistic diversity of Romansh.
  • The paper employs rule-based heuristics and a specialized tokenizer, attaining around 80% coverage across a varied corpus of Romansh texts.
  • The paper demonstrates a dictionary-based variety identification method with up to 100% accuracy on longer texts, enabling advanced Romansh NLP applications.

rumlem: A Dictionary-Based Lemmatizer and Variety Identifier for Romansh

Introduction

The paper "rumlem: A Dictionary-Based Lemmatizer for Romansh" (2604.11233) introduces a lemmatization and variety identification tool specifically tailored for Romansh, a minority Romance language with highly divergent regional varieties. The paper's primary motivation is the absence of computational linguistic resources for Romansh and the high linguistic diversity among its varieties, necessitating the development of variety-aware NLP applications. The system achieves lemmatization and morphological annotation for all five idioms of Romansh plus the standard form, Rumantsch Grischun (RG), leveraging comprehensive, community-curated lexical databases.

System Architecture and Linguistic Coverage

rumlem utilizes the Pledari Grond dictionary resource, with individual lexical datasets for Sursilvan, Sutsilvan, Surmiran, Puter, Vallader, and RG. The system preprocesses these resources to extract lemma mappings, POS tags, inflectional features, and German translations. Dedicated rule-based heuristics supplement incomplete annotations, producing a resource of over 725,000 unique word forms mapped to approximately 178,000 lemmas.

Preprocessing involves pattern recognition and normalization of dictionary entries, handling highly heterogeneous morphological structures common in minority and low-resource languages. Test cases and manual gold-standard annotations ensure rigorous corpus normalization, contributing to high output consistency.

rumlem processes each Romansh text via these steps:

  1. Tokenization using a modified Italian Moses tokenizer, optimized for Romansh-specific orthography and contractions.
  2. Lemmatization employing a dictionary lookup that returns all candidate lemmas and morphosyntactic features for each form.
  3. Variety Assignment by matching the proportion of recognized forms to each idiom's lexicon, determining the most likely variety, or accepting an explicit user-specified variety.

Lemmatization Performance

The evaluation is based on 30,000 Romansh texts covering a wide spectrum of genres and lengths, from short speech transcripts to long-form narrative texts. rumlem attains an average coverage of 80%—that is, it can assign lemmas and morphosyntactic analyses to 77–84% of tokens in a typical Romansh text (punctuation excluded). Coverage plateaus for longer or more standard language varieties but is adversely affected by:

  • Out-of-vocabulary items (e.g., proper nouns, neologisms).
  • Tokens involving contractions or orthographic variation not encapsulated in lexicons.
  • Unannotated forms due to dictionary gaps.

The transparent, dictionary-based paradigm ensures high precision, though it inherits annotation errors and cannot handle contextual (disambiguating) information. Ambiguity in inflectional forms results in multiple analyses per token when context is absent.

Variety and Language Identification

A significant contribution of rumlem is the deployment of dictionary-based variety identification. For each input text, rumlem computes match proportions for each variety lexicon, predicting variety with 95% accuracy overall. On texts over 300 tokens, variety identification accuracy converges to 100%. These results are robust across genres and text lengths and demonstrate superiority to prior SVM-based classifiers on shorter or out-of-domain data, which exhibit F1 scores as low as 0.7.

rumlem further functions as a Romansh vs. Romance language identifier. In experiments with 5,000 texts across French, Italian, Catalan, Romanian, and Romansh, rumlem's distributions for recognized Romansh versus non-Romansh texts are well separated. An empirically determined threshold on variety-specific match ratios yields near-perfect language identification, with misclassifications arising solely from highly noisy or code-switched input.

Limitations

Several inherent limitations are identified:

  • Incompleteness of lexicons restricts coverage, particularly for dialectal, rare, or newly coined forms.
  • Ambiguity is unresolved in the absence of contextual disambiguation, as is typical for context-agnostic systems.
  • The reliance on dictionary quality and update frequency underscores the need for continued resource development.
  • Licensing restrictions for Vallader and Puter lexicons limit downstream system release and open-source integration.

Implications and Future Directions

This work furnishes an essential building block for Romansh NLP, enabling downstream tasks such as POS tagging, named entity recognition, and variety-annotated corpora development. The methodology of leveraging high-coverage, community-maintained lexical resources for minority languages is generalizable; it demonstrates that dictionary-based systems, when supported by high-quality resources, can yield competitive performance—especially when annotated corpora and treebanks are unavailable.

Future advances are likely to emerge by hybridizing dictionary-based approaches with statistical or neural sequence models capable of contextual disambiguation and OOV generalization. The open-source release (except for certain idioms) encourages further contribution and adaptation.

Additionally, variety identification derived from lemmatizer statistics may be extended to more fine-grained or hierarchical classification, leveraging sub-variety or register distinctions. Integration with digital spell-checking and machine translation systems for Romansh is a direct application, especially given its transparent and auditable nature—a critical aspect for minority languages with sensitive sociolinguistic status.

Conclusion

rumlem constitutes a rigorously engineered, high-precision lemmatizer and variety identifier specifically optimized for Romansh and its major varieties. Its reliance on robust morphological lexicons yields high coverage and accuracy for both lemmatization and text variety attribution. While constrained by dictionary completeness and context-agnostic analysis, the system addresses critical resource and infrastructure gaps in Romansh computational linguistics and supplies a foundation for future language technology development in low-resource, high-variation environments.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.