Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Eliciting Better Multilingual Structured Reasoning from LLMs through Code (2403.02567v2)

Published 5 Mar 2024 in cs.CL and cs.AI

Abstract: The development of LLMs (LLM) has shown progress on reasoning, though studies have largely considered either English or simple reasoning tasks. To address this, we introduce a multilingual structured reasoning and explanation dataset, termed xSTREET, that covers four tasks across six languages. xSTREET exposes a gap in base LLM performance between English and non-English reasoning tasks. We then propose two methods to remedy this gap, building on the insight that LLMs trained on code are better reasoners. First, at training time, we augment a code dataset with multilingual comments using machine translation while keeping program code as-is. Second, at inference time, we bridge the gap between training and inference by employing a prompt structure that incorporates step-by-step code primitives to derive new facts and find a solution. Our methods show improved multilingual performance on xSTREET, most notably on the scientific commonsense reasoning subtask. Furthermore, the models show no regression on non-reasoning tasks, thus demonstrating our techniques maintain general-purpose abilities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (23)
  1. Mega: Multilingual evaluation of generative ai. arXiv preprint arXiv:2303.12528.
  2. The falcon series of open language models.
  3. Codekgc: Code language model for generative knowledge graph construction. arXiv preprint arXiv:2304.09048.
  4. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  5. Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning. In The Eleventh International Conference on Learning Representations.
  6. Corrpus: Code-based structured prompting for neurosymbolic story understanding. In Findings of the Association for Computational Linguistics: ACL 2023, pages 13152–13168.
  7. How good are gpt models at machine translation? a comprehensive evaluation. ArXiv, abs/2302.09210.
  8. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations.
  9. The stack: 3 tb of permissively licensed source code. arXiv preprint arXiv:2211.15533.
  10. This land is Your, My land: Evaluating geopolitical biases in language models.
  11. Holistic evaluation of language models. Annals of the New York Academy of Sciences, 1525:140 – 146.
  12. Language models of code are few-shot commonsense learners. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1384–1403, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  13. Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786.
  14. Learning math reasoning from self-sampled correct and partially-correct solutions. In International Conference on Learning Representations.
  15. Bidirectional language models are also few-shot learners. In The Eleventh International Conference on Learning Representations.
  16. Street: A multi-task structured reasoning and explanation benchmark. In The Eleventh International Conference on Learning Representations.
  17. Language models are multilingual chain-of-thought reasoners. In The Eleventh International Conference on Learning Representations.
  18. Challenging big-bench tasks and whether chain-of-thought can solve them. In Annual Meeting of the Association for Computational Linguistics.
  19. Emergent abilities of large language models. Transactions on Machine Learning Research.
  20. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  21. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
  22. DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales. arXiv preprint arXiv:2308.01320.
  23. Causal reasoning of entities and events in procedural texts. In Findings of the Association for Computational Linguistics: EACL 2023, pages 415–431, Dubrovnik, Croatia. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Bryan Li (17 papers)
  2. Tamer Alkhouli (7 papers)
  3. Daniele Bonadiman (10 papers)
  4. Nikolaos Pappas (188 papers)
  5. Saab Mansour (32 papers)
Citations (2)