Papers
Topics
Authors
Recent
2000 character limit reached

MGSM Benchmark: Multilingual Grade School Math

Updated 25 November 2025
  • MGSM is a benchmark that assesses multilingual mathematical reasoning using 250 grade-school word problems across 10 diverse languages.
  • It employs meticulous human translation and multi-tiered quality controls along with varied chain-of-thought prompting strategies.
  • Results reveal significant improvements in translation fidelity and reduced performance gaps in multilingual LLM evaluations after refinements.

The Multilingual Grade School Math (MGSM) benchmark is a standardized evaluation suite designed to assess the mathematical reasoning abilities of LLMs across a diverse set of natural languages. Originating as a direct extension of the GSM8K English benchmark, MGSM enables systematic investigation of cross-lingual chain-of-thought (CoT) reasoning, with particular attention to both high-resource and underrepresented languages. MGSM is now a widely used resource for both static and functional multi-lingual evaluation and has catalyzed methodological development for translation fidelity, quality assurance, and model robustness evaluation in multilingual math reasoning contexts (Shi et al., 2022, Chen et al., 2023, Peter et al., 7 Nov 2025, Ojewale et al., 25 Jun 2025).

1. Benchmark Composition and Multilingual Coverage

MGSM comprises 250 grade-school math word problems, originally sampled from the GSM8K test split. Each problem requires between two and eight reasoning steps and is manually translated into ten representative languages: English, Chinese, German, French, Spanish, Russian, Japanese, Bengali, Swahili, Telugu, and Thai (Shi et al., 2022). These span eight language families, covering a range of scripts (Latin, Cyrillic, Han, Kana, Devanagari, Thai, etc.) and resource levels in LLM pretraining corpora, with corpus frequencies ranging from ~0.78% for Chinese to ~0.0002% for Telugu in PaLM’s pretraining data (Shi et al., 2022). MGSM’s design explicitly addresses coverage of both widely-represented and “low-resource” languages, facilitating assessment of linguistic generalization in arithmetic and reasoning tasks.

2. Translation Methodology and Quality Control

Original MGSM translations were executed by professional native speakers (1–5 per language), each with ≥2 years of experience and contractually restricted from machine translation (MT) usage (Shi et al., 2022). The workflow included selection, assignment, and multi-tiered human verification: random rechecking by a secondary translator and nn-gram overlap analyses against known MT output to detect unduly literal or auto-generated content. Any high-overlap case underwent manual review, and only verified human translations entered the released dataset (Shi et al., 2022).

However, subsequent analyses revealed that even professional translation yields non-trivial semantic errors (e.g., “5 less” rendered as “5 times less” in German), arithmetic ambiguities (“round up” vs “round to nearest”), and context inversions (weekday mismatches). Such errors caused significant artifactual gaps in reported cross-lingual model performance, motivating the development of automatic QA pipelines: majority-voting schemes across high-performing LLMs are used to flag ambiguous or error-prone items, triggering targeted back-translation and manual correction (Peter et al., 7 Nov 2025). This ongoing refinement led to a release of an MGSM-Rev2 variant, with demonstrably improved fairness in multilingual evaluation.

3. Benchmark Format, Prompting Strategies, and Evaluation Protocols

MGSM offers arithmetic and multi-step word problems involving diverse operations: addition, subtraction, multiplication, division, fractions, unit conversions, and rate problems (Shi et al., 2022). Model evaluation leverages multiple CoT prompting schemes:

  • Direct: No intermediate CoT steps.
  • Native-CoT: CoT in the problem’s target language.
  • EN-CoT: CoT in English with native-language question.
  • Translate-EN: Machine-translated question and English CoT (Shi et al., 2022).

Few-shot settings include “Native-Exemplars” (in-language), “English-Exemplars” (cross-lingual), and “Multilingual-Exemplars” (one per high-resource language). Each output is scored via strict numeric accuracy: Accuracy=number of correct answers250×100%.\text{Accuracy} = \frac{\text{number of correct answers}}{250} \times 100\%. Statistical reporting uses error rates and 95% confidence intervals under normal approximation (Shi et al., 2022).

In parallel, larger-scale instruction corpora—such as MGSM8KInstruct—translate the GSM8K training split (7,473 items per language) via prompted LLMs, with formula consistency checks and formula error filtering for robust SFT (Chen et al., 2023).

4. Results, Cross-Lingual Gaps, and Data Quality Revisions

Initial MGSM results showed model scale-dependent emergence of multilingual math reasoning: PaLM-540B and GPT-3-davinci-class models approached or exceeded 50% accuracy in both high- and low-resource languages (e.g., PaLM-540B achieves 51.2% on Swahili and 57.2% on German; see Table below) (Shi et al., 2022). EN-CoT and Translate-EN strategies consistently outperformed direct and native-only CoT, with “Translate-EN” yielding peak scores (English: 62.4%, German: 57.2%, Bengali: 53.2%, Swahili: 51.2%) (Shi et al., 2022).

Language PaLM-540B, Translate-EN (%) ChatGPT 2-shot (%) MathOctopus-13B-C (%)
English 62.4 67.2 51.6
German 57.2 46.0
French 55.2 51.2
Spanish 60.0
Bengali 53.2
Swahili 51.2 40.0 46.0
Chinese 55.6 52.8 48.8

However, critical re-examination revealed that much of the observed English-to-other-language performance gap (often 15–40 p.p. as originally reported) was an artifact of translation errors and inconsistent answer extraction scripts. After automatic QA and standardization, the maximal en↔L2 gap contracted to <6 p.p. for all strong models, and in some cases to <2 p.p., with low-resource languages such as Bengali and French gaining +16 to +29 p.p. in measured accuracy (Peter et al., 7 Nov 2025). This finding substantially revises earlier conclusions about LLMs’ cross-lingual mathematical reasoning limits.

5. Extensions: MGSM8KInstruct, Supervised Fine-Tuning, and the MathOctopus Family

To address training data scarcity for xMR (multilingual math reasoning), MGSM8KInstruct extends the original benchmark: all 7,473 GSM8K training examples and their solution chains are automatically translated to ten target languages using LLMs with in-prompt exemplars and strict token-preservation rules (Chen et al., 2023). A rigorous pipeline ensures arithmetic and formula consistency, discarding translation runs with repeated formula errors.

Using MGSM8KInstruct for supervised fine-tuning (SFT), the MathOctopus models are trained with two key strategies:

  • Parallel Training: Both question and CoT in the same target language, optimizing in-domain MGSM performance.
  • Cross Training: English question, target-language CoT, yielding better out-of-domain and monolingual English performance (Chen et al., 2023).

MathOctopus-13B (cross-trained) achieves 47.6% on MGSM, surpassing ChatGPT’s 46.3%. Notably, cross-lingual SFT can significantly boost monolingual English performance (e.g., MathOctopus-7B gains +8.4 p.p. over monolingual SFT on GSM8K English) (Chen et al., 2023). Multilingual rejection fine-tuning (xRFT) adds correct alternative reasoning paths per sample, yielding modest (1–2 p.p.) in-domain gains, but potential out-of-domain losses if overused.

6. Practical Usage: Evaluation Pitfalls and Recommendations

Recent studies document that uncorrected translation artifacts and non-standardized answer extraction have led to erroneous or exaggerated conclusions regarding cross-lingual capability gaps in LLMs (Peter et al., 7 Nov 2025). For robust MGSM evaluation:

  • Employ semi-automatic QA pipelines: flag suspect translations where majority of strong LLMs fail to match English references; verify flagged items manually or via back-translation.
  • Use standardized answer extraction scripts—such as the “last-number” approach—to ensure numerically consistent comparisons across languages and scripts. Pseudocode implementing this extractor is explicitly detailed in (Peter et al., 7 Nov 2025).

Broader recommendations include publishing both raw and post-QA-corrected benchmark data, maintaining transparency for answer parsing code, and standardizing evaluation protocol to comport with best practices from the machine translation community (e.g., SacreBLEU analogs) (Peter et al., 7 Nov 2025).

7. Broader Impact and Future Directions

MGSM and its derivatives have become canonical resources for probing LLM multilingual generalization in arithmetic, with demonstrated utility for downstream reasoning (XCOPA, XL-WiC), model pretraining, and instruction-tuning research (Shi et al., 2022, Chen et al., 2023). The co-evolution of MGSM with functional and symbolic evaluation suites (e.g., CL-GSM-Symbolic) has revealed limitations of purely static multilingual benchmarks, especially for low-resource settings (Ojewale et al., 25 Jun 2025). Ongoing and future work aims to expand both linguistic coverage (e.g., inclusion of Arabic, Hindi, Yoruba) and scale (70B+ backbone models), as well as integrate RLHF-style refinements and systematic error correction pipelines (Chen et al., 2023, Ojewale et al., 25 Jun 2025).

MGSM’s trajectory illustrates the importance of data quality, evaluation methodology, and linguistic diversity in progressing towards truly global, robust, and reliable LLM reasoning capabilities.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Multilingual Grade School Math (MGSM) Benchmark.