MGSM8KInstruct: Multilingual Math Tuning
- MGSM8KInstruct is a multilingual math instruction-tuning dataset comprising parallel problem–solution pairs across 10 languages.
- It is built using automated translation and rigorous formula verification to ensure high mathematical fidelity with error rates below 1%.
- The dataset supports supervised fine-tuning and cross-training approaches that significantly enhance LLM performance in both in-domain and cross-lingual tasks.
MGSM8KInstruct is a large-scale instruction-tuning corpus specifically constructed for multilingual mathematical reasoning with LLMs. Developed to address the scarcity of diverse, high-quality datasets for multilingual mathematical tasks, MGSM8KInstruct comprises parallel problem–solution pairs spanning ten languages, enabling both supervised fine-tuning (SFT) and the development of advanced multilingual math reasoning models. Its introduction has directly advanced the methodological frontier in multilingual math LLMs, supporting both improved in-domain accuracy and transferability across languages (Chen et al., 2023).
1. Dataset Composition and Construction
MGSM8KInstruct is derived from the English GSM8K training set (7,473 items), augmented by translating each problem–solution pair into nine additional languages using ChatGPT. The resulting dataset contains approximately 73,600 pairs distributed as follows:
| Language | Pairs | Notable Issues |
|---|---|---|
| English | 7,473 | Reference |
| Swahili | 7,472 | Low-resource: numeral/formula checks |
| Chinese | 7,466 | Non-Latin script, formula integrity |
| Bengali | 6,539 | 12% dropped for formula mismatch |
| German | 7,466 | — |
| Spanish | 7,470 | — |
| French | 7,469 | — |
| Japanese | 7,471 | Non-Latin script |
| Russian | 7,361 | Minor manual calibration |
| Thai | 7,473 | Non-Latin script |
Translation emphasizes the preservation of all Arabic numerals and formula fragments (e.g., “<<12/60=0.2>>”) and involves an automated post-check comparing formulas in translated solutions to the English originals. Instances with repeated formula misalignment are discarded, keeping translation error below 1% (Chen et al., 2023).
Each item is encapsulated in a fixed instruction–response schema:
1 2 3 4 |
{
"instruction": "translated word problem",
"output": "LaTeX-style chain-of-thought solution"
} |
2. Instruction-Generation Protocol and Quality Control
The translation protocol is designed for maximal mathematical fidelity:
- ChatGPT translation prompts require: preservation of numerals, unchanged (but recalculated) formula fragments within “<<…=…>>”, and mimicry of two in-prompt worked examples.
- Each translation undergoes an automated formula check. If five consecutive formula fragments in a language fail to align, that item is dropped.
- Manual spot-checking verifies that translation errors remain below 1%.
Special attention is given to low-resource and non-Latin-script languages; numerals and embedded formulae are carefully checked for every language to ensure consistency with the English original. In Russian and Bengali, persistent mistranslation of proper names results in aggressive culling to maintain quality (Chen et al., 2023).
3. Integration into Supervised Fine-Tuning and Model Training
MGSM8KInstruct is central to SFT for multilingual mathematical reasoning models such as MathOctopus. Key aspects include:
- Models: LLaMA-2 (7B, 13B), LLaMA-1 (33B), and variants.
- Training regimes:
- Parallel-training: both question and solution in the same language.
- Cross-training: question in English, solution in the foreign language.
Experiments adopt prompt templates mirroring instruction–response formats used in the dataset. Standard hyperparameters include a learning rate of , three epochs, maximum length 512 tokens, and batch sizes (8/4/2) for 7B/13B/33B models, respectively.
4. Empirical Impact on Multilingual Mathematical Reasoning
MGSM8KInstruct enables significant improvements over monolingual baselines for mathematical reasoning across a suite of benchmarks. On the in-domain MGSM testset (250 items/language), MathOctopus-7B achieves:
- 32.2% accuracy with parallel-training,
- 40.0% with cross-training, compared to 22.6% by vanilla LLaMA-2.
On the original English GSM8K, MathOctopus-7B trained with cross-tuning achieves 50.8% accuracy (up from 42.4%), and 49.3% for parallel-tuning; similar gains propagate to 13B and 33B scales. MathOctopus-13B (cross) reaches 47.6% on MGSM and 56.6% on GSM8K, outperforming reported ChatGPT two-shot baselines (Chen et al., 2023).
Further, the corpus substantially boosts both in-domain (parallel) and out-of-domain (cross) generalization. Even partial multilingual SFT (three-language subset) improves held-out language performance, with effects modulated by typological proximity.
5. Methodological Discoveries and Observations
Empirical investigation with MGSM8KInstruct reveals several principles:
- Multilingual SFT, even when overlayed on entirely synthetic data, feeds back into enhanced monolingual performance. This is exemplified by the MathOctopus-7B leap from 42.4% to 50.8% English accuracy post-multilingual SFT.
- Cross-training yields superior out-of-domain generalization, including on benchmarks such as MSVAMP, compared to parallel-training which excels in strictly in-domain, parallel-format evaluation.
- Augmenting SFT with multilingual rejection sampling (xRFT) slightly diversifies provable solution paths but provides only marginal 1–2% lifts in-domain and may reduce cross-lingual robustness.
- Robustness is confirmed as even ablated multilingual SFT (limited language subset) benefits performance on languages outside the training set.
A plausible implication is that aligned, parallel mathematical corpora created via systematic translation strategies offer a powerful route for equipping LLMs for robust multilingual mathematical reasoning beyond monolingual-only approaches.
6. Applications in Preference Optimization Frameworks
MGSM8KInstruct serves as the primary SFT baseline and in-domain training set for preference-based alignment recipes such as MAPO (Multilingual Alignment-as-Preference Optimization). For MAPO, SFT on MGSM8KInstruct precedes preference optimization using translation-based consistency signals, driving significant gains: MathOctopus-13B with MAPO-DPO achieves 58.0% on MGSM (up by +6.6 points over SFT baseline) (She et al., 2024). The dataset’s scale and multilingual alignment also ensure robust transfer to non-English mathematical reasoning and facilitate preference pair construction for DPO and PPO training on multilingual math tasks.
7. Significance and Future Directions
MGSM8KInstruct validates the construction of synthetic, formula-checked instruction-tuning corpora for scaling multilingual math reasoning in LLMs. The observed monolingual and cross-lingual gains underscore the utility of parallel data in mathematical domains where annotated resources are scarce. The pipeline established for MGSM8KInstruct—translation with embedded formula verification and consistent instruction schema—provides a blueprint for extending instruction tuning to additional domains and language families, supporting the development of scalable, high-fidelity multilingual LLMs for scientific and technical reasoning (Chen et al., 2023).