Deep Learning Lemmatizer

Updated 28 November 2025

The paper demonstrates that deep learning lemmatizers convert inflected forms to canonical lemmas using neural architectures like edit-script classifiers, seq2seq, and joint models.
Core methods leverage contextual embeddings and character-level features to address morphological ambiguity and enhance performance in complex language settings.
Hybrid approaches combine neural and rule-based components to improve robustness, particularly in low-resource or morphologically rich languages.

A deep learning based lemmatizer is an NLP system that transforms inflected word forms into their canonical dictionary form (lemma) using neural architectures. This class of lemmatizers replaces traditional rule-based and lexicon-driven approaches with end-to-end differentiable models, typically leveraging subword or character-level representations, contextual encodings, and data-driven transformation mechanisms such as edit scripts or sequence generation. Deep learning lemmatizers are fundamental for text normalization in morphologically complex languages and underlie modern parsing, information retrieval, and cross-lingual NLP systems.

1. Neural Model Architectures for Lemmatization

Deep learning based lemmatizers are generally classified by their core architecture and the representation of context and morphological information.

Edit-action classifiers ("edit-script" lemmatizers):

Many recent systems, including those in (Toporkov et al., 2023) and (Toporkov et al., 8 Oct 2025), cast lemmatization as a classification problem over a set of minimal edit scripts (SES) that transform a word form into its lemma. The input is a word in sentential context (token, optionally enriched with character-level features and morphosyntactic tags); the output is an SES label, deterministically decoded into the lemma.

The input pipeline typically comprises:
- Pretrained contextual embeddings (e.g., XLM-RoBERTa, mBERT, or language-specific BERT variants)
- Optionally, a BiLSTM over characters for sub-token representations
- Optionally, UPOS or UniMorph tag embeddings
The architecture is either a Transformer-based token classifier or a concatenated embedding passed through a BiLSTM before the classification head.

Sequence-to-sequence models:

Alternate approaches (e.g., (Kanerva et al., 2019, Karwatowski et al., 2022)) use encoder-decoder models (BiLSTM+Attention, or Transformer-based T5) to generate the lemma as a character or subword sequence, optionally conditioned on morphosyntactic features or local/global context. These models are prominent for languages with high morphological ambiguity and for settings where edit operations may not capture all systematic variations.

Joint multi-task models:

Integrated architectures extend the above to simultaneously predict morphological tags and lemmata, leveraging the mutual benefit between tag disambiguation and lemma generation (Kondratyuk et al., 2018, Malaviya et al., 2019, Kestemont et al., 2016). A common strategy is to share lower-level encoders (character and context embeddings) with separate decoding heads for PoS-tagging and lemmatization.

Hybrid models ("Editor’s term"):

Hybrid lemmatizers combine neural modules with dictionary lookup and/or handcrafted post-processing, especially for agglutinative or data-scarce languages (Berkecz et al., 2023, Dorkin et al., 29 Dec 2024). A neural model typically acts as a fallback (classification, seq2seq, or cross-encoder disambiguation) in the case of dictionary ambiguity or OOV forms.

2. Representation of Context and Morphological Features

While surrounding context is essential for resolving morphological ambiguity, deep lemmatizers differ in how they represent and utilize context:

Contextual embeddings:

Transformer-based models implicitly encode both left and right context for each token position, rendering explicit local context windows largely redundant (Toporkov et al., 2023). Sequence tagging approaches process entire sentences, extracting per-token contextual representations $h_t = \text{Transformer}(x_1,\dots,x_n)_t$ .

Tag-based ("tag context") approaches:

Some models concatenate explicit morphosyntactic tag features (UPOS, XPOS, FEATS) alongside the input token or its subword/character sequence (Kanerva et al., 2019). This practice, however, sees diminishing returns: in (Toporkov et al., 2023), gains from including fine-grained tags over basic UPOS in high-resource settings are usually $<$ 0.3 percentage points and become non-significant for most languages.

Sliding window and short context:

Earlier BiLSTM or attention-based sequence-to-sequence models (e.g., Lematus, LemMED (Makazhanov et al., 2020)) encode a fixed-width window of adjacent words or characters, but empirical evidence suggests modern wide-context encoders outperform localized context, except for very resource-constrained or non-transformer settings (Makazhanov et al., 2020, Bergmanis et al., 2019).

Summary Table: Morphological feature utilization

Method	Tag usage	Main empirical finding
Edit-script	UPOS/Fine-Feats	Tags matter little with transformer base
Seq2seq	FEATS+window	Useful in low-resource or non-transformer
Joint models	Predicted tags	Joint learning aids low-resource lemm.

The overall evidence indicates that transformer-based encoders implicitly capture fine-grained morphosyntactic information sufficient for disambiguation in most practical contexts (Toporkov et al., 2023).

3. Training Procedures and Optimization

Input Construction:

Tokens are embedded via transformer subword tokenization; optionally, character-level BiLSTM/CNNs enhance rare form encoding. Morphological tag embeddings are concatenated if used.

Loss Functions:

Edit-classification models minimize cross-entropy over the SES/edit-class labels per token:

$\mathcal{L} = -\frac{1}{T} \sum_{t=1}^T \sum_{i=1}^{|\text{SES}|} \mathbf{1}\{y_t=i\} \log p(y_t=i)$

Seq2seq models minimize character-level sequence cross-entropy, optionally with label smoothing or multi-task terms if integrated with tagging (Malaviya et al., 2019).

Training Regimes:

Batch sizes: 16–32 sentences (edit classifiers), 64 tokens (seq2seq, (Kanerva et al., 2019)).
Optimizers: AdamW or Adam (β₁=0.9, β₂=0.999), weight decay 0.01, LR 1e-5–5e-5 (transformer fine-tuning), learning-rate decay/early stopping on dev accuracy.
Epochs: 5–25 (edit-classifiers), up to 50 (seq2seq/character-based); convergence usually by epoch 10–30.
Data: Universal Dependencies/Unimorph (up to 100+ languages), typically split into in-domain (train/dev) and out-of-domain (test) corpora to probe real-world generalization (Toporkov et al., 2023, Kanerva et al., 2019).

Data Augmentation:

Synthetic autoencoding of sampled surface forms for low-resource settings (Kanerva et al., 2019).
Injection of FST- or lexicon-generated lemma pairs (Kanerva et al., 2019, Milintsevich et al., 2021).
In hybrid models, neural and rule/dictionary systems are trained independently and integrated via deterministic pipelines (Berkecz et al., 2023, Dorkin et al., 29 Dec 2024).

4. Empirical Findings, Evaluation, and Comparison

Token-level vs. Sentence-level evaluation:

Word-level ("token accuracy") can mask subtle or compound failures, as 99%+ accuracy in-domain corresponds to much lower sentence-level correctness.
Sentence accuracy (all tokens correct in a sentence) surfaces robustness, especially out-of-domain, where degradation is more pronounced (e.g., 3–5pp drop for English/Spanish, 8–12pp for Basque/Turkish) (Toporkov et al., 2023).

Generalization and domain adaptation:

Transformer-based lemmatizers with minimal or no explicit morphology outperform or match gold-tag-based approaches on 4/6 typologically diverse languages (Toporkov et al., 2023).
In out-of-domain or low-resource settings, supervised encoder models (XLM-RoBERTa) remain competitive, but are now rivaled or surpassed by large instruction-tuned LLMs using few-shot in-context prompting, with best results for direct in-context lemma generation in 10 of 12 languages evaluated (Toporkov et al., 8 Oct 2025).

Error characteristics:

Most systems struggle with OOV/unambiguous tokens in unseen domains.
Hybrid models find rare or irregular forms challenging but achieve near-oracle coverage by combining morphological analyzers with neural disambiguators (Dorkin et al., 29 Dec 2024).

Recent SOTA and Best Practices Table

Model	Word Acc.	Sentence Acc.	Notes
XLM-RoBERTa fine-tuned	0.92	0.34	Supervised, SIGMORPHON bench (Toporkov et al., 8 Oct 2025)
Mistral-2407 (4-shot)	0.91	0.32	LLM, in-context, cross-lingual
Claude-3.7-Sonnet (4s.)	0.93	0.39	LLM, in-context
T5-large (Polish)	0.91	n/a	Contextual Polish lemmatization (Karwatowski et al., 2022)
HuSpaCy hybrid (XLM-R)	0.99	n/a	Hybrid, Hungarian
GliLem (Estonian)	0.98	n/a	Hybrid, BERT+open vocab (Dorkin et al., 29 Dec 2024)

5. Enhancements: Hybridization, Augmentation, and External Resources

Hybrid architectures:

Systems such as HuSpaCy and GliLem combine high-coverage dictionaries or rule-based analyzers with neural classifiers. This architecture addresses indexical ambiguity and OOV coverage, and yields consistent error reductions of 0.3–0.6% over purely neural models in Hungarian (Berkecz et al., 2023) and nearly an order-of-magnitude reduction in Estonian error vs. morphological-analyzer baselines (Dorkin et al., 29 Dec 2024).

External resources:

Dual-encoder seq2seq lemmatizers can incorporate external lemma candidates from FSTs or lexica, exposing them as second attention heads to the decoder (Milintsevich et al., 2021).
Data augmentation with synthetic forms or morphological transducers provides significant gains in low-resource settings, with 19–24% relative error reduction over purely data-driven approaches (Kanerva et al., 2019).

Key empirical finding:

Hybrid and externally-augmented models are most beneficial for morphologically rich, ambiguous, or low-resource languages, particularly in high-ambiguity or cross-domain deployment scenarios.

6. Trends, Recommendations, and Open Issues

Fine-tuning large pretrained transformers on (token, context) data achieves state-of-the-art lemmatization across diverse languages and domains.
Explicit use of detailed morphological features rarely improves out-of-domain generalization and may even degrade robustness when the feature set differs between train/test (Toporkov et al., 2023).
Large LLMs (Mistral, Claude) now represent a genuinely practical, competitive approach for zero-shot or few-shot lemmatization by in-context generation, especially where supervised data is unavailable (Toporkov et al., 8 Oct 2025).
Word-level metrics should not be overinterpreted; always report sentence accuracy and analyze by SES type or error category for realistic assessment.
The architecture-recipe for robust lemmatization:
1. Leverage contextual transformer embeddings (preferably XLM-RoBERTa or equiv.)
2. Use SES classification or character-level seq2seq as the main mechanism
3. Add lightweight tag embeddings if UPOS tags are robustly available
4. Integrate hand-crafted rules or dictionary fallbacks in agglutinative/low-resource settings as needed
5. Evaluate both in- and out-of-domain, on both word and sentence accuracy

7. Future Directions

Extension to truly multilingual and zero-shot lemmatization, exploiting language-agnostic representations and transfer (cross-PUD, cross-lingual training, in-context LLMs) (Toporkov et al., 8 Oct 2025).
Improved domain adaptation and OOV handling, especially to handle annotation drift and rare/irregular inflection patterns.
Enhanced integration of external lexica and analyzers at runtime, allowing end-to-end differentiable selection as in dual-encoder or cross-encoder frameworks (Milintsevich et al., 2021, Dorkin et al., 29 Dec 2024).
Fine-grained analysis and mitigation of cascading error modes in downstream applications (e.g., lexical search, information retrieval) as evidenced by measurable but modest improvements in recall@k for downstream pipelines (Dorkin et al., 29 Dec 2024).

In conclusion, deep learning based lemmatizers have evolved from simple character-level RNNs to highly effective hybrid and transformer-based systems, displacing most rule-based and dictionary-only methods. Empirical evidence consistently prioritizes contextualized transformer embeddings and SES or seq2seq transformation layers as core building blocks, with optional hybridization or augmentation enhancing performance in linguistically challenging or data-scarce regimes (Toporkov et al., 2023, Toporkov et al., 8 Oct 2025, Milintsevich et al., 2021, Berkecz et al., 2023, Dorkin et al., 29 Dec 2024).