Comparability of per‑word all‑category accuracy with unspecified category sets

Determine whether the 97.6% per‑word “all categories” accuracy reported for the DeepPavlov BERT‑based Russian morphological tagger is directly comparable to the 95.3559% per‑word “all categories” accuracy reported for the Multi‑head attention–based tagger when evaluated on the specific category set {upos, Mood, VerbForm, Person, Animacy, Degree, Variant, Number, Gender, NumForm, Case, Tense, Voice}, given that the DeepPavlov evaluation does not specify the exact set of morphological categories and that accuracy varies substantially with the chosen category set.

Background

The paper reports 95.3559% accuracy for predicting all categories of a word using the proposed Multi‑head attention (MHA) architecture on Russian, evaluated on the set {upos, Mood, VerbForm, Person, Animacy, Degree, Variant, Number, Gender, NumForm, Case, Tense, Voice}. It notes that a BERT‑based approach (DeepPavlov) reports 97.6% for a similar task.

However, the authors emphasize that different choices of evaluated categories can significantly change the metric (they observe a drop to 84.861% when testing the full list of categories), and they state that the DeepPavlov work does not specify which categories were included. Consequently, they explicitly question whether the two figures are directly comparable.

References

Though it is clearly better than obtained by proposed architecture, we are unable to be sure that this results can be comparable, because the authors do not give the exact list of the categories being analyzed.

A Multi-head-based architecture for effective morphological tagging in Russian with open dictionary  (2604.02926 - Skibin et al., 3 Apr 2026) in Results, paragraph discussing per‑word “predicting all categories of a word” (immediately after Table: Quality of recognizing all grammatical categories)