Papers
Topics
Authors
Recent
2000 character limit reached

MERA: A Comprehensive LLM Evaluation in Russian (2401.04531v3)

Published 9 Jan 2024 in cs.CL and cs.AI

Abstract: Over the past few years, one of the most notable advancements in AI research has been in foundation models (FMs), headlined by the rise of LMs. As the models' size increases, LMs demonstrate enhancements in measurable aspects and the development of new qualitative features. However, despite researchers' attention and the rapid growth in LM application, the capabilities, limitations, and associated risks still need to be better understood. To address these issues, we introduce an open Multimodal Evaluation of Russian-language Architectures (MERA), a new instruction benchmark for evaluating foundation models oriented towards the Russian language. The benchmark encompasses 21 evaluation tasks for generative models in 11 skill domains and is designed as a black-box test to ensure the exclusion of data leakage. The paper introduces a methodology to evaluate FMs and LMs in zero- and few-shot fixed instruction settings that can be extended to other modalities. We propose an evaluation methodology, an open-source code base for the MERA assessment, and a leaderboard with a submission system. We evaluate open LMs as baselines and find that they are still far behind the human level. We publicly release MERA to guide forthcoming research, anticipate groundbreaking model features, standardize the evaluation procedure, and address potential societal drawbacks.

Citations (7)

Summary

  • The paper introduces MERA, a benchmark that evaluates Russian LLMs across 21 diverse tasks in zero- and few-shot settings.
  • The paper details a multimodal evaluation framework using methods like log-likelihood scoring and greedy generation to assess performance and ethical biases.
  • The paper provides an open-source code base and public leaderboard, establishing robust human and baseline evaluations for transparent model assessment.

The paper introduces MERA, a benchmark designed to evaluate LLMs specifically for the Russian language, catering to the growing interest and applications of Foundation Models (FMs) in natural language processing for non-English languages. MERA aims to provide a comprehensive evaluation platform that assesses the capabilities, shortcomings, and associated risks of these models by offering structured and standardized testing protocols.

Key Contributions of MERA:

  • Multimodal Evaluation Framework: MERA includes a diverse set of 21 evaluation tasks covering 11 skill domains, tailored specifically to Russian. The tasks are designed in zero- and few-shot instruction settings, allowing for broad applicability across different types of LLMs.
  • Comprehensive Task Set: Tasks in MERA span areas from natural language understanding to ethics, encompassing problem-solving, exam-based scenarios, and diagnostics for potential ethical biases. Specific tasks include MathLogicQA, MultiQ, PARus, RCB, ruModAr, ruMultiAr, ruOpenBookQA, and others.
  • Evaluation Methodology: The benchmark is designed as a black-box test, synergizing with methodologies like log-likelihood scoring and greedy generation for task performance assessments. This ensures exclusion of data leakage and provides reliable, reproducible assessment results.
  • Open-Source Code Base and Leaderboard: MERA includes an open-source code repository and a submission system with a public leaderboard for ongoing model evaluation, fostering transparency and driving forward the research community's ability to track progress in LLM capabilities.
  • Strong Baselines and Human Evaluation: The authors provide a range of baseline evaluations using publicly available models and build robust human baselines through rigorous evaluation processes involving expert annotators for selected tasks.

Implementation Specifics:

  • Instruction Format and Adaptation: Tasks in MERA are presented in an instruction format to ensure consistency across modalities and tasks and to test models' adaptability in varied contexts.
  • Complexity Addressing and Future Scope: The framework ensures adaptability to varied datasets and modalities beyond text, aiming to incorporate multimedia data like images and audio in subsequent releases.
  • Diagnostic Use: Diagnostic tasks are integral to MERA, released with public answers to ensure transparency and act as tools for further analysis of ethical biases and potential societal impacts of LLMs.

Results and Observations:

  • The baseline results highlight that contemporary open-source LLMs fall below human-level performance, indicating the complexity and challenge embedded in the MERA tasks, which are aligned across domains such as ethics, knowledge assessment, and reasoning.
  • Noteworthy performance disparities appear across different tasks, with models like Mistral and Llama-2 showing superior results relative to others in select arithmetic and reasoning domains.
  • The diagnostic tasks underscore the continuing need for in-depth model evaluations related to ethical biases, suggesting that ongoing refinement in methodology can aid significantly in responsible AI development.

In conclusion, MERA represents a significant step towards understanding and evaluating LLMs in the Russian language, providing tools and metrics necessary to assess ongoing progress and identify risks associated with large-scale model deployment.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 0 likes about this paper.