Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Language Models are Multilingual Chain-of-Thought Reasoners (2210.03057v1)

Published 6 Oct 2022 in cs.CL, cs.AI, and cs.LG
Language Models are Multilingual Chain-of-Thought Reasoners

Abstract: We evaluate the reasoning abilities of LLMs in multilingual settings. We introduce the Multilingual Grade School Math (MGSM) benchmark, by manually translating 250 grade-school math problems from the GSM8K dataset (Cobbe et al., 2021) into ten typologically diverse languages. We find that the ability to solve MGSM problems via chain-of-thought prompting emerges with increasing model scale, and that models have strikingly strong multilingual reasoning abilities, even in underrepresented languages such as Bengali and Swahili. Finally, we show that the multilingual reasoning abilities of LLMs extend to other tasks such as commonsense reasoning and word-in-context semantic judgment. The MGSM benchmark is publicly available at https://github.com/google-research/url-nlp.

A Detailed Evaluation of Multilingual Chain-of-Thought Reasoning with LLMs

The paper "LLMs are Multilingual Chain-of-Thought Reasoners" presents a comprehensive exploration into the cross-linguistic reasoning capabilities of advanced LLMs such as GPT-3 and PaLM. By introducing the Multilingual Grade School Math (MGSM) benchmark, the authors create a novel evaluation framework to gauge the arithmetic reasoning efficacy of these models in a multilingual context. This exploration addresses critical aspects of both model performance and transferability, testing not only English-language abilities but also extending these evaluations to ten diverse languages, including underrepresented ones like Bengali and Swahili.

The core contribution is the MGSM benchmark, an extension of the GSM8K dataset which contains 250 manually translated math problems into ten languages. This carefully curated dataset is specifically designed to test multistep reasoning in a multilingual context, thus offering a unique contribution to the field. The experiments involve comparing the effectiveness of various prompting techniques, including Direct, Native Chain-of-Thought (CoT), and English CoT, across different languages and model scales.

Key Findings and Numerical Results

The paper outlines several pivotal findings. First, through systematic experiments, it is observed that multistep reasoning emerges with increasing model scale, an insight that underscores the importance of computational resources for developing reasoning capabilities in LLMs. PaLM-540B, with a parameter count substantially higher than that of GPT-3, demonstrates superior performance across all languages, achieving an average problem-solving rate of 55% with the Translate-EN strategy.

Interestingly, the language frequency within pre-training datasets was not strongly correlated with performance, contradicting some previous literature that suggested a direct relationship. For instance, PaLM's accuracy on underrepresented languages like Bengali and Swahili was comparable to that on higher-resourced languages, highlighting the potential of large models for equitable multilingual implementation.

Further insights emerge from the use of intermediate reasoning steps (CoT). English CoT consistently yields better or near-par results compared to CoT in the native problem language, suggesting English's role as a robust intermediary language for cross-linguistic transfer in reasoning tasks.

The extrapolation of these findings to other tasks, such as commonsense reasoning with XCOPA and the semantic judgment task XL-WiC, validates the efficacy of multilingual CoT prompting. PaLM-540B sets a new benchmark state-of-the-art with 89.9% accuracy on XCOPA, significantly outperforming previous best models trained with extensive supervised datasets.

Implications for Future AI Developments

These results imply substantial theoretical and practical implications. Theoretically, they suggest a promising direction for multilingual NLP research by leveraging advanced LLMs' scaling capabilities and cross-linguistic transfer skills. Practically, the performance gains in low-resource language settings with relatively minimal language exposure indicate an avenue for more inclusive AI systems, providing sophisticated language understanding and reasoning capabilities across a broad language spectrum.

This research opens several directions for future exploration, such as further scaling of models, optimizing CoT strategies, and exploring cross-linguistic reasoning in real-world applications. The continued refinement and expansion of benchmarks like MGSM will be crucial, enriching the dialogue between linguistic diversity and machine reasoning growth.

In conclusion, the authors present a methodical and detailed examination of multilingual reasoning, advancing both academic understanding and practical approaches in AI LLM research. Their work informs future efforts in creating more universal and accessible AI language systems, prioritizing both computational innovation and linguistic inclusivity.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Freda Shi (16 papers)
  2. Mirac Suzgun (23 papers)
  3. Markus Freitag (49 papers)
  4. Xuezhi Wang (64 papers)
  5. Suraj Srivats (1 paper)
  6. Soroush Vosoughi (90 papers)
  7. Hyung Won Chung (30 papers)
  8. Yi Tay (94 papers)
  9. Sebastian Ruder (93 papers)
  10. Denny Zhou (65 papers)
  11. Dipanjan Das (42 papers)
  12. Jason Wei (49 papers)
Citations (250)
Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

Youtube Logo Streamline Icon: https://streamlinehq.com