RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark (2010.15925v2)

Published 29 Oct 2020 in cs.CL and cs.AI

Abstract: In this paper, we introduce an advanced Russian general language understanding evaluation benchmark -- RussianGLUE. Recent advances in the field of universal LLMs and transformers require the development of a methodology for their broad diagnostics and testing for general intellectual skills - detection of natural language inference, commonsense reasoning, ability to perform simple logical operations regardless of text subject or lexicon. For the first time, a benchmark of nine tasks, collected and organized analogically to the SuperGLUE methodology, was developed from scratch for the Russian language. We provide baselines, human level evaluation, an open-source framework for evaluating models (https://github.com/RussianNLP/RussianSuperGLUE), and an overall leaderboard of transformer models for the Russian language. Besides, we present the first results of comparing multilingual models in the adapted diagnostic test set and offer the first steps to further expanding or assessing state-of-the-art models independently of language.

PDF Abstract

RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark

The paper "RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark" presents RussianGLUE, a benchmark analogous to SuperGLUE but specifically tailored for the Russian language. This development acknowledges the growing significance of assessing NLP models beyond English, emphasizing linguistic diversity and promoting comprehensive diagnostics of LLMs.

Methodology and Benchmark Design

The authors have developed a benchmark consisting of nine tasks aligned with the SuperGLUE methodology. These tasks aim to evaluate various cognitive skills of NLP models, including natural language inference, commonsense reasoning, and the ability to execute logical operations independent of text subject or lexicon. The translation of SuperGLUE's framework to Russian involves linguistic challenges particular to the language, prompting customized task creation to accommodate unique linguistic categories inherent to Russian.

Evaluation and Baselines

To establish baselines, the paper evaluates two prominent LLMs: Multilingual BERT (MultiBERT) and a Russian-specific BERT variant (RuBERT). The evaluation reveals a nuanced performance wherein RuBERT demonstrated superior performance on tasks involving textual entailment and reading comprehension. Notably, RuBERT achieved an accuracy of 0.894 on word in context disambiguation tasks, surpassing human-level performance of 0.747. MultiBERT's performance, though slightly lower, remains competitive, especially in tasks requiring commonsense reasoning. A baseline using a TF-IDF model highlights the substantial performance gap between traditional techniques and modern transformers.

Human Evaluation

The paper includes human performance benchmarks, utilizing a crowd-sourcing approach to reassess a sample of the test set for each task. This step ensures that model outputs are not only compared against baselines but also against human performance, providing a more comprehensive evaluation metric for model capabilities.

Linguistic Insights and Diagnostic Evaluation

The diagnostic set designed for RussianGLUE, translated professionally from its English counterpart, allows for a cross-linguistic evaluation of linguistic phenomena using the same experimental framework. The diagnostic evaluation reveals discrepancies in how LLMs interpret linguistic features across languages, potentially indicating that model architecture may cater more effectively to English syntax than to Russian.

Implications and Future Research

RussianGLUE is positioned to drive advancements in NLP for the Russian language by providing a standardized framework for evaluating the performance of LLMs. The paper suggests future research avenues including multilingual diagnostics, improved task complexity, and expanding the benchmark to include additional linguistic components such as seq2seq tasks and knowledge graphs. The introduction of industrial performance metrics could facilitate the practical deployment of these models in applications requiring resource-efficient solutions.

Overall, RussianGLUE addresses a significant void in the linguistic evaluation landscape by providing a comprehensive and culturally nuanced benchmark for NLP model evaluation in Russian. The implications of this work are far-reaching, potentially influencing future developments in multilingual NLP and enhancing the linguistic capabilities of models developed for less ubiquitous languages.

PDF Markdown Bookmark Chat (Pro)

Authors (10)

Tatiana Shavrina (18 papers)
Alena Fenogenova (17 papers)
Anton Emelyanov (4 papers)
Denis Shevelev (5 papers)
Ekaterina Artemova (53 papers)
Valentin Malykh (24 papers)
Vladislav Mikhailov (31 papers)
Maria Tikhonova (10 papers)
Andrey Chertok (7 papers)
Andrey Evlampiev (1 paper)

Citations (80)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - RussianNLP/RussianSuperGLUE: Russian SuperGLUE benchmark (109 stars)