A Multi-head-based architecture for effective morphological tagging in Russian with open dictionary

Published 3 Apr 2026 in cs.CL | (2604.02926v1)

Abstract: The article proposes a new architecture based on Multi-head attention to solve the problem of morphological tagging for the Russian language. The preprocessing of the word vectors includes splitting the words into subtokens, followed by a trained procedure for aggregating the vectors of the subtokens into vectors for tokens. This allows to support an open dictionary and analyze morphological features taking into account parts of words (prefixes, endings, etc.). The open dictionary allows in future to analyze words that are absent in the training dataset. The performed computational experiment on the SinTagRus and Taiga datasets shows that for some grammatical categories the proposed architecture gives accuracy 98-99% and above, which outperforms previously known results. For nine out of ten words, the architecture precisely predicts all grammatical categories and indicates when the categories must not be analyzed for the word. At the same time, the model based on the proposed architecture can be trained on consumer-level graphics accelerators, retains all the advantages of Multi-head attention over RNNs (RNNs are not used in the proposed approach), does not require pretraining on large collections of unlabeled texts (like BERT), and shows higher processing speed than previous results.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces a transformer-based multi-head attention architecture that effectively handles Russian morphological tagging.
The paper employs subtoken aggregation via byte-pair encoding and intra-word attention to robustly manage out-of-vocabulary words.
The paper achieves 98–99% accuracy on key grammatical features using only 48M parameters, providing a lightweight alternative to larger BERT systems.

Multi-Head Attention Architecture for Open-Vocabulary Russian Morphological Tagging

Introduction

This paper introduces a morphological tagging system for Russian grounded in a multi-head attention (MHA) architecture, emphasizing compatibility with open dictionaries—that is, the model efficiently handles previously unseen or fictional words, a necessity for robust natural language understanding in Russian. The model discards both RNNs and heavy transformer pretraining (such as BERT), offering a solution with markedly reduced computational demands, fast training, and competitive or superior accuracy on core grammatical tagging tasks.

Architectural Overview and Data Pipeline

The primary innovation is a multi-stage architecture that processes linguistic segments (sentences), where each word is decomposed into subtokens via Byte Pair Encoding (BPE). This forms the basis for a pipeline facilitating open-vocabulary support. The model then leverages hierarchical encoding: within-word token attentions produce compositional word vectors, which are subsequently passed through transformer encoder blocks. Morphological classification is performed per word using a feedforward neural classifier.

Figure 1: The data processing pipeline, including BPE-based segmentation, within-word attention, transformer encoding, and final classification.

Tokenization is not restricted to dictionary words; words are partitioned into subtokens, ensuring the model’s vocabulary dynamically accommodates neologisms and out-of-vocabulary terms. The design opts for BPE over $n$ -gram approaches owing to its flexibility and marginal influence on down-stream tagging accuracy.

Word Representation and Attention Mechanism

Each word is capped at six subtokens (sufficient for covering 98% of cases in the datasets). Within each word, dot-product attention computes token importances, aggregating subtoken embeddings via weighted sum informed by a learned network. This hierarchical attention lifts the representation to the word level, where standard positional encoding is applied before processing with four transformer encoders. The resulting representations fuel a classifier predicting grammatical categories.

Notably, the system avoids RNN-related drawbacks—sequential bottlenecks and context imbalance at sequence boundaries—and obviates pretraining on large unlabeled corpora.

Datasets and Experimental Details

The model is trained and evaluated on the union of two major treebanks: UD SynTagRus and UD Taiga. This merged dataset encompasses over 2.9 million words for training, with test sets capturing a wide range of Russian grammatical and lexical phenomena. Model training is conducted on a single consumer-grade GPU (nVidia RTX 4090), with convergence achieved within 8–12 hours using approximately 48 million parameters.

Numerical Results and Comparative Analysis

Experiments assess accuracy, recall, precision, and F1 per morphological category. The architecture attains 98–99% accuracy on most grammatical categories. Overall, the mean accuracy across categories approaches 99.05%. Full-category accuracy per word reaches approximately 95.36% for a selected canonical set, comparative to, and on some categories exceeding, results from heavy-weight BERT-based taggers and substantially surpassing RNN- or GRU-based systems. For instance:

UPOS accuracy: 98.46% (vs. 98.33% for state-of-the-art RNN-based models)
Average model size: 48M parameters (vs. >400M for BERT-based models)
Training runtime: under 12 hours on a single consumer GPU

The model demonstrates clear efficiency gains: smaller parameter footprint, reduced hardware requirements, and faster training and inference, without sacrificing accuracy for core categories.

Implications and Future Directions

The results have significant implications for scalable, resource-efficient NLP systems addressing morphologically rich languages. In practical deployments, the model's open-vocabulary handling is critical for real-time applications (e.g., voice assistants, machine translation, or digital linguistic tools) operating on dynamic or out-of-domain text where dictionary closure is infeasible.

Theoretically, the findings suggest that full transformer-scale pretraining is not strictly necessary for high-accuracy morphological tagging when appropriate subtokenization and local attention mechanisms are exploited. Situations prioritizing deployment cost or rapid adaptation to new domains will benefit particularly.

Potential extensions involve integrating richer $n$ -gram features, applying the architecture to abbreviated or non-standard forms (where context weighting becomes pivotal), and extending the model to related morphological analysis tasks in similarly complex languages.

Conclusion

The presented system leverages multi-head attention without full transformer pretraining, achieves near state-of-the-art accuracy in Russian morphological tagging, and is operationally efficient on commodity hardware. Its explicit subtoken design promotes robust open-dictionary handling. The work suggests a promising direction for practical, deployable linguistic models for morphologically complex languages, balancing accuracy and efficiency, and lays a foundation for further exploration of lightweight yet expressive encoder architectures for sequence labeling.

Markdown Report Issue