Papers
Topics
Authors
Recent
Search
2000 character limit reached

MGSM8KInstruct: Multilingual Math Tuning

Updated 2 April 2026
  • MGSM8KInstruct is a multilingual math instruction-tuning dataset comprising parallel problem–solution pairs across 10 languages.
  • It is built using automated translation and rigorous formula verification to ensure high mathematical fidelity with error rates below 1%.
  • The dataset supports supervised fine-tuning and cross-training approaches that significantly enhance LLM performance in both in-domain and cross-lingual tasks.

MGSM8KInstruct is a large-scale instruction-tuning corpus specifically constructed for multilingual mathematical reasoning with LLMs. Developed to address the scarcity of diverse, high-quality datasets for multilingual mathematical tasks, MGSM8KInstruct comprises parallel problem–solution pairs spanning ten languages, enabling both supervised fine-tuning (SFT) and the development of advanced multilingual math reasoning models. Its introduction has directly advanced the methodological frontier in multilingual math LLMs, supporting both improved in-domain accuracy and transferability across languages (Chen et al., 2023).

1. Dataset Composition and Construction

MGSM8KInstruct is derived from the English GSM8K training set (7,473 items), augmented by translating each problem–solution pair into nine additional languages using ChatGPT. The resulting dataset contains approximately 73,600 pairs distributed as follows:

Language Pairs Notable Issues
English 7,473 Reference
Swahili 7,472 Low-resource: numeral/formula checks
Chinese 7,466 Non-Latin script, formula integrity
Bengali 6,539 12% dropped for formula mismatch
German 7,466
Spanish 7,470
French 7,469
Japanese 7,471 Non-Latin script
Russian 7,361 Minor manual calibration
Thai 7,473 Non-Latin script

Translation emphasizes the preservation of all Arabic numerals and formula fragments (e.g., “<<12/60=0.2>>”) and involves an automated post-check comparing formulas in translated solutions to the English originals. Instances with repeated formula misalignment are discarded, keeping translation error below 1% (Chen et al., 2023).

Each item is encapsulated in a fixed instruction–response schema:

1
2
3
4
{
  "instruction": "translated word problem",
  "output": "LaTeX-style chain-of-thought solution"
}
The entire question serves as the "instruction" field, and the stepwise solution in target language with LaTeX mathematical formatting as "output".

2. Instruction-Generation Protocol and Quality Control

The translation protocol is designed for maximal mathematical fidelity:

  • ChatGPT translation prompts require: preservation of numerals, unchanged (but recalculated) formula fragments within “<<…=…>>”, and mimicry of two in-prompt worked examples.
  • Each translation undergoes an automated formula check. If five consecutive formula fragments in a language fail to align, that item is dropped.
  • Manual spot-checking verifies that translation errors remain below 1%.

Special attention is given to low-resource and non-Latin-script languages; numerals and embedded formulae are carefully checked for every language to ensure consistency with the English original. In Russian and Bengali, persistent mistranslation of proper names results in aggressive culling to maintain quality (Chen et al., 2023).

3. Integration into Supervised Fine-Tuning and Model Training

MGSM8KInstruct is central to SFT for multilingual mathematical reasoning models such as MathOctopus. Key aspects include:

  • Models: LLaMA-2 (7B, 13B), LLaMA-1 (33B), and variants.
  • Training regimes:
    • Parallel-training: both question and solution in the same language.
    • Cross-training: question in English, solution in the foreign language.

Experiments adopt prompt templates mirroring instruction–response formats used in the dataset. Standard hyperparameters include a learning rate of 2×1052 \times 10^{-5}, three epochs, maximum length 512 tokens, and batch sizes (8/4/2) for 7B/13B/33B models, respectively.

4. Empirical Impact on Multilingual Mathematical Reasoning

MGSM8KInstruct enables significant improvements over monolingual baselines for mathematical reasoning across a suite of benchmarks. On the in-domain MGSM testset (250 items/language), MathOctopus-7B achieves:

  • 32.2% accuracy with parallel-training,
  • 40.0% with cross-training, compared to 22.6% by vanilla LLaMA-2.

On the original English GSM8K, MathOctopus-7B trained with cross-tuning achieves 50.8% accuracy (up from 42.4%), and 49.3% for parallel-tuning; similar gains propagate to 13B and 33B scales. MathOctopus-13B (cross) reaches 47.6% on MGSM and 56.6% on GSM8K, outperforming reported ChatGPT two-shot baselines (Chen et al., 2023).

Further, the corpus substantially boosts both in-domain (parallel) and out-of-domain (cross) generalization. Even partial multilingual SFT (three-language subset) improves held-out language performance, with effects modulated by typological proximity.

5. Methodological Discoveries and Observations

Empirical investigation with MGSM8KInstruct reveals several principles:

  • Multilingual SFT, even when overlayed on entirely synthetic data, feeds back into enhanced monolingual performance. This is exemplified by the MathOctopus-7B leap from 42.4% to 50.8% English accuracy post-multilingual SFT.
  • Cross-training yields superior out-of-domain generalization, including on benchmarks such as MSVAMP, compared to parallel-training which excels in strictly in-domain, parallel-format evaluation.
  • Augmenting SFT with multilingual rejection sampling (xRFT) slightly diversifies provable solution paths but provides only marginal 1–2% lifts in-domain and may reduce cross-lingual robustness.
  • Robustness is confirmed as even ablated multilingual SFT (limited language subset) benefits performance on languages outside the training set.

A plausible implication is that aligned, parallel mathematical corpora created via systematic translation strategies offer a powerful route for equipping LLMs for robust multilingual mathematical reasoning beyond monolingual-only approaches.

6. Applications in Preference Optimization Frameworks

MGSM8KInstruct serves as the primary SFT baseline and in-domain training set for preference-based alignment recipes such as MAPO (Multilingual Alignment-as-Preference Optimization). For MAPO, SFT on MGSM8KInstruct precedes preference optimization using translation-based consistency signals, driving significant gains: MathOctopus-13B with MAPO-DPO achieves 58.0% on MGSM (up by +6.6 points over SFT baseline) (She et al., 2024). The dataset’s scale and multilingual alignment also ensure robust transfer to non-English mathematical reasoning and facilitate preference pair construction for DPO and PPO training on multilingual math tasks.

7. Significance and Future Directions

MGSM8KInstruct validates the construction of synthetic, formula-checked instruction-tuning corpora for scaling multilingual math reasoning in LLMs. The observed monolingual and cross-lingual gains underscore the utility of parallel data in mathematical domains where annotated resources are scarce. The pipeline established for MGSM8KInstruct—translation with embedded formula verification and consistent instruction schema—provides a blueprint for extending instruction tuning to additional domains and language families, supporting the development of scalable, high-fidelity multilingual LLMs for scientific and technical reasoning (Chen et al., 2023).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MGSM8KInstruct.