MERA Multi Benchmark
- MERA Multi is an open multimodal benchmark suite that systematically evaluates Russian large multimodal models using a unified taxonomy and 18 tasks across four modalities.
- It employs explicit construction protocols, unified prompt designs, and quantitative metrics like Exact Match and Judge Score to assess model performance.
- The methodology ensures replicability across languages, robust data protection, and serves as a central reference for future multimodal benchmark development.
Mera Multi refers to the open multimodal evaluation suite and benchmark methodology—also referenced as "AnonymBench"—introduced for the systematic, instruction-based assessment of Russian-language large multimodal models (MLLMs). It encompasses a universal taxonomy of multimodal abilities, a set of 18 rigorously constructed evaluation tasks covering four modalities (text, image, audio, video), unified prompt and metric protocols, and a fully documented methodology for data protection and benchmarking. The design of MERA Multi offers replicability for other typologically diverse languages and serves as a central reference for both model evaluation and the construction of future multimodal benchmarks (Chervyakov et al., 19 Nov 2025).
1. Taxonomy of Multimodal Abilities
MERA Multi establishes a universal, strictly hierarchical taxonomy to systematize the evaluation of multimodal reasoning and perception. The taxonomy is a tree , grouping capabilities under three top-level branches:
- Perception: Fine-grained single-instance perception (e.g., object recognition, localization), cross-instance event recognition, and textual grounding (such as OCR or diagram reading).
- Knowledge: Everyday factual knowledge and advanced domain knowledge (e.g., numeracy, science, cultural facts).
- Reasoning: Inductive (attribute inference, scene parsing), deductive (causal/analogical), abductive (hypothetical/counterfactual), quantitative (counting, mathematical logic), and other forms (problem decomposition, critical thinking).
Each node is annotated by a modality mask , supporting precise sub-task mapping and cross-modal probing (Chervyakov et al., 19 Nov 2025).
2. Construction and Documentation of Evaluation Tasks
MERA Multi extends this taxonomy into 18 specific tasks spanning all core modalities and their intersections. Datasets are constructed from scratch, with explicit measures to ensure linguistic, cultural, and domain relevance for Russian. Construction protocols strictly enforce:
- Data Provenance: Private tasks—such as RealVQA (image–text VQA)—use crowdsourced collection under NDA, majority vote annotation with 5-way overlap, and explicit test/dev separation. Public tasks—such as ruMathVQA—leverage expert curation.
- Unified Input/Output Specification: Each task provides explicit input schemas (e.g., byte-encoded images plus Unicode-normalized questions) and output formats (typically single-token answers in normalized text).
- Cultural and Linguistic Adaptation: Annotation and prompt schemes match Russian educational, scientific, and social contexts.
A representative example (RealVQA) includes 773 samples with no train set, and 5-way majority-voted test/dev splits. Prompt templates are rotated across 10 variants, combining reasoning request, answer format specification, and task cues (Chervyakov et al., 19 Nov 2025).
3. Unified Prompting and Quantitative Evaluation Protocols
All tasks adhere to a standardized "block" prompt protocol, abstractable via:
1 2 3 4 5 6 7 8 9 10 |
[AttentionHook] [TaskDescription] [InputDescription] [ProcessingInstruction] [ContextIfAny] [Question] [AnswerOptions?] [ReasoningRequest?] [AnswerFormatInstruction] ANSWER: |
Each prompt is instantiated in 10 distinct wording and layout styles to reduce overfitting and measure reasoning under stylistic variation.
Metrics are strictly defined:
- Exact Match (EM):
- Judge Score (JS): LLM-based binary correctness assessment,
- Final Score (FS): Per-task:
- Total Benchmark Score (T): Weighted coverage and average accuracy across modalities: , where is mean per-task accuracy and is coverage across tasks (Chervyakov et al., 19 Nov 2025).
4. Baseline Results and Coverage Analysis
Benchmarked models include both closed-source and open-source architectures, scored using the above metrics. The table below presents top-10 model outcomes:
| Model | Total Score | Attempted | Coverage |
|---|---|---|---|
| Qwen3-Omni-30B-A3B-Inst | 0.434 | 0.523 | 0.828 |
| Qwen2.5-Omni-7B | 0.302 | 0.302 | 1.000 |
| Qwen2.5-VL-72B-Inst | 0.257 | 0.386 | 0.667 |
| GPT-4.1 | 0.143 | 0.430 | 0.333 |
Open-source models exhibit broader coverage (especially on video/audio) compared to closed, though GPT-4.1 excels on image-only tasks. Specialist architectures (e.g., ultravox) dominate audio but lag on other modalities (Chervyakov et al., 19 Nov 2025).
5. Robustness, Leakage Prevention, and Licensing
MERA Multi incorporates robust methodologies to ensure benchmark validity:
- Watermarking: Imperceptible audio watermarks (AudioSeal), semi-transparent image/video overlays. For all modalities, statistical testing confirms ≤5% impact on JS at 95% CI.
- Leakage Detection: A multimodal SMIA approach computes semantic and length differences for neighbor samples and uses supervised binary classification to detect performance artifacts due to overlap with model training data. Reported AUC–ROC: 88.7% (image), 88.4% (video), 81.3% (audio).
- Licensing: Private data are released exclusively under a non-commercial, evaluation-only license prohibiting any model training or fine-tuning (Chervyakov et al., 19 Nov 2025).
6. Protocol for Extension to Other Languages and Domains
MERA Multi's methodology is designed for replicability:
- Task Mapping and Adaptation: Flowchart ensures that target-language and cultural requirements are analyzed first, tasks are mapped to taxonomy nodes, and new private data are collected via secure pipelines.
- Prompt and Metric Reuse: All prompt templates and LLM judge architectures are reused, with only natural language translation required.
- Cultural Fidelity and Quality: 5-way annotator overlap and native expert engagement ensure high-quality adaptation; MS–MIA leakage analysis is recommended for each new domain.
- Leaderboard Model: Open tracking of results encourages community benchmarking.
The complete construction protocol is encoded in a LaTeX flowchart with recommended practices for language and domain transfer (Chervyakov et al., 19 Nov 2025).
7. Significance and Impact
MERA Multi provides the first comprehensive, culturally and linguistically grounded evaluation suite for Russian multimodal models and introduces replicable best practices applicable across languages within the Slavic family and beyond. It enables rigorous, instruction-based multimodal benchmarking, critical for the broader understanding of architecture limitations, robustness, and real-world applicability in under-resourced languages. By publishing both methodology and metrics, MERA Multi sets a technical baseline for future evaluations and facilitates the principled construction of further multimodal benchmarks (Chervyakov et al., 19 Nov 2025).