Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
127 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
53 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
4 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

Marco-Bench-MIF: Multilingual LLM Instruction Benchmark

Updated 17 July 2025
  • Marco-Bench-MIF is a multilingual benchmark designed to evaluate LLM instruction-following and cultural adaptation across 30 languages.
  • It employs a hybrid localization process combining machine translation with human post-editing to ensure semantic equivalence and fairness.
  • The benchmark reveals significant performance gaps between high-resource and low-resource languages, guiding improvements in multilingual LLM development.

Marco-Bench-MIF is a carefully curated multilingual benchmark designed to evaluate the instruction-following capabilities of LLMs with high cross-linguistic and cross-cultural fidelity. It expands upon previous efforts in instruction-based evaluation by introducing a rigorously localized extension of the IFEval dataset, specifically constructed to expose both the strengths and limitations of LLMs across 30 languages in realistic, non-English settings (2507.11882).

1. Dataset Construction and Localization

Marco-Bench-MIF is generated through a hybrid localization process that combines automated machine translation with human post-editing, bilingual verification, and cultural adaptation. The development pipeline addresses key linguistic constraints—such as the inapplicability of capitalization requirements to Chinese and other non-Latin scripts—and systematically substitutes culturally specific references (e.g., replacing U.S. company names with region-local entities) to preserve the intent and challenge of each instruction prompt.

The preservation of instruction constraints (keywords, formatting, compositional logic) is a central challenge due to variation in grammar, morphology, and orthography across languages. To enforce fairness and semantic equivalence, manual verification and post-hoc correction ensure that subtle translation artifacts and cultural mismatches are resolved, preventing test-set leakage or bias. The resulting benchmark captures nuance and diversity unmatched by datasets produced through naive automated translation.

2. Coverage, Evaluation Protocol, and Metrics

Marco-Bench-MIF covers 30 languages, each reflecting a different degree of resource richness and typological diversity. The evaluation protocol is designed to probe both prompt-level and instruction-level adherence, using two main scoring metrics:

  • Strict Evaluation:

Estrict(r,i)={1,if response r exactly satisfies instruction i 0,otherwiseE_{\text{strict}}(r, i) = \begin{cases} 1, & \text{if response } r \text{ exactly satisfies instruction } i \ 0, & \text{otherwise} \end{cases}

This metric judges only perfectly compliant outputs as correct, exposing subtle weaknesses in constraint following and formatting.

  • Loose Evaluation:

Eloose(r,i)=maxτT{Estrict(τ(r),i)}E_{\text{loose}}(r, i) = \max_{\tau \in T} \{ E_{\text{strict}}(\tau(r), i) \}

where TT is a set of normalization operations (e.g., removal of markdown or neutralization of superficial text structure). This metric accounts for superficial or formatting deviations that do not affect substantive correctness.

The multi-level evaluation enables granular analysis—not only in aggregate accuracy per language/model, but also in revealing specific failure points for individual instruction types.

3. Multilingual Analysis and Key Findings

Comprehensive evaluation of more than 20 LLMs on Marco-Bench-MIF reveals several central phenomena:

  • Resource-Specific Performance Gaps: There is a consistent 25–35% accuracy gap between high-resource and low-resource languages. High-resource languages (such as English, French, Chinese) benefit from larger pretraining corpora, model exposure, and vocabulary coverage, whereas low-resource languages often exhibit dramatically reduced compliance, especially for compositional and precise instructions.
  • Impact of Model Scale: Performance varies by 45–60% across different model sizes. Larger models tend to generalize better across languages, but persistent script-specific and structural challenges remain, especially in right-to-left and non-Latin script languages.
  • Localization vs. Naive Translation: Benchmarks using only machine-translated (non-localized) data systematically underestimate model capability by 7–22%. This indicates that machine translation fails to capture critical cultural and syntactic nuances, resulting in artificially low scores and misleading assessments of LLM multilingual proficiency.
  • Preservation of Constraints: A recurring difficulty for LLMs is the preservation of keyword consistency and simultaneous satisfaction of multiple compositional constraints. Minor errors in translation of a single keyword or omission of a required formatting element lead to significant drops in strict evaluation scores, accentuating the challenge of robust instruction following in typologically distant languages.
  • Script and Cultural Effects: Languages with non-Latin alphabets or distinctive punctuation (e.g., Chinese, Arabic, Hebrew) present unique formatting challenges. Tasks trivial in English may become nontrivial or ambiguous due to script or orthographic convention.

4. Benchmark Structure and Availability

The Marco-Bench-MIF dataset and full evaluation scripts are openly available at https://github.com/AIDC-AI/Marco-Bench-MIF. The benchmark is structured to be extensible, enabling future addition of further languages and dialects. Each language set is accompanied by both human-localized and automatically translated variants, facilitating direct comparison and error analysis.

Prompt suites are partitioned into localized and non-localized (machine-translated) settings, allowing for systematic ablation studies and fairness checks across linguistic boundaries.

5. Implications for Multilingual Model Development

Marco-Bench-MIF demonstrates that instruction-following ability in LLMs is not universally robust across languages, even for models with large parameters and sizable pretraining data. The observed performance gaps and error breakdowns provide direct evidence that model designers must confront both data resource gaps and the limits of direct transfer from English-centric architectures.

Preserving prompt-level constraint logic, maintaining compositional instruction fidelity, and correctly handling culturally adapted references are all necessary for LLMs intended for global deployment. Failure to explicitly localize benchmarks or evaluate beyond machine-translation can result in substantial underestimation of model weaknesses and inhibit fair progress measurement.

6. Identified Challenges and Prospects for Future Work

Key open challenges emerging from the benchmark include:

  • Expansion to More Languages and Modalities: Incorporating typologically intense or minority languages (e.g., those with distinct scripts or rare structures) is essential for comprehensive multilingual evaluation.
  • Dynamic/Interactive Evaluation: Static prompt-response may not reflect realistic user interactions. Interactive or incremental prompt refinement could provide deeper diagnostic power.
  • Instruction Engineering and Data Collection: Developing methods for systematically generating culturally and linguistically robust instructions and gold solutions remains an active research area.
  • Architectural Innovations: Advances in modeling—such as mechanisms specifically for compositional constraint tracking or script-aware tokenization—may be warranted to close the observed performance gaps.

A plausible implication is that instruction-following, especially under localization and compositional constraint, should be a central consideration in both the construction of evaluation suites and the training objectives for next-generation LLMs.

7. Summary Table: Core Properties

Property Description Source
Number of Languages 30 (2507.11882)
Localization Approach Hybrid pipeline (translation with human/cultural adaptation and verification) (2507.11882)
Model Coverage 20+ LLMs evaluated (varying in scale and capabilities) (2507.11882)
Accuracy Gap 25–35% (high vs. low-resource languages) (2507.11882)
Localization Effect 7–22% underestimation in machine-translation vs. localized variants (2507.11882)
Main Challenges Keyword consistency, compositional constraint, script-specific adaptation (2507.11882)

Marco-Bench-MIF represents a significant advance in the multilingual evaluation of LLMs by combining cultural localization with linguistic breadth. It draws attention to crucial gaps in model robustness across languages and scripts, establishing an essential resource for both research and model development in instruction-following AI.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)