Comparability of sentence‑level exact‑match accuracy across studies with unclear category lists

Determine whether the 22% sentence‑level exact‑match accuracy obtained when evaluating all morphological categories in the Russian dataset for the Multi‑head attention–based tagger is comparable to sentence‑level results reported in other works, given that other studies do not clearly specify which morphological categories are included in their evaluations.

Background

The authors evaluate sentence‑level performance by requiring all categories of all words in a sentence to be predicted correctly. When all categories present in the dataset are included, they obtain approximately 22% fully correct sentences.

They explicitly note that sentence‑level exact match is highly sensitive to the set of evaluated categories and that other work does not clearly report which categories are used, leaving the comparability of their 22% figure unresolved.

References

If we take into account absolutely all categories in the dataset, the result shows that approximately every fifth sentence (22%) can be identified fully correctly, but we still unaware could this result be compared with other results due to unclear list of categories.

A Multi-head-based architecture for effective morphological tagging in Russian with open dictionary  (2604.02926 - Skibin et al., 3 Apr 2026) in Results, final paragraph of the section