Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Dual Instruction Tuning with Large Language Models for Mathematical Reasoning (2403.18295v1)

Published 27 Mar 2024 in cs.CL

Abstract: Recent advancements highlight the success of instruction tuning with LLMs utilizing Chain-of-Thought (CoT) data for mathematical reasoning tasks. Despite the fine-tuned LLMs, challenges persist, such as incorrect, missing, and redundant steps in CoT generation leading to inaccuracies in answer predictions. To alleviate this problem, we propose a dual instruction tuning strategy to meticulously model mathematical reasoning from both forward and reverse directions. This involves introducing the Intermediate Reasoning State Prediction task (forward reasoning) and the Instruction Reconstruction task (reverse reasoning) to enhance the LLMs' understanding and execution of instructions. Training instances for these tasks are constructed based on existing mathematical instruction tuning datasets. Subsequently, LLMs undergo multi-task fine-tuning using both existing mathematical instructions and the newly created data. Comprehensive experiments validate the effectiveness and domain generalization of the dual instruction tuning strategy across various mathematical reasoning tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Mathqa: Towards interpretable math word problem solving with operation-based formalisms. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2357–2367.
  2. Palm 2 technical report. arXiv preprint arXiv:2305.10403.
  3. Qwen technical report. arXiv preprint arXiv:2309.16609.
  4. Constitutional AI: harmlessness from AI feedback. arXiv preprint arXiv:2212.08073.
  5. Graph of thoughts: Solving elaborate problems with large language models. arXiv preprint arXiv:2308.09687.
  6. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
  7. Wenhu Chen. 2023. Large language models are few(1)-shot table reasoners. In Findings of the Association for Computational Linguistics, pages 1090–1100.
  8. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588.
  9. Theoremqa: A theorem-driven question answering dataset. arXiv preprint arXiv:2305.12524.
  10. Training verifiers to solve math word problems. CoRR, abs/2110.14168.
  11. Advancing mathematics by guiding human intuition with AI. Nature, 600(7887):70–74.
  12. DeepSeek-AI. 2024. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2309.16609.
  13. Active prompting with chain-of-thought for large language models. arXiv preprint arXiv:2302.12246.
  14. GLM: general language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, pages 320–335.
  15. PAL: program-aided language models. In International Conference on Machine Learning, pages 10764–10799.
  16. Dual learning for machine translation. In Advances in Neural Information Processing Systems 29, pages 820–828.
  17. Measuring mathematical problem solving with the MATH dataset. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual.
  18. CAMEL: communicative agents for "mind" exploration of large scale language model society. arXiv preprint arXiv:2303.17760.
  19. Program induction by rationale generation: Learning to solve and explain algebraic word problems. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pages 158–167. Association for Computational Linguistics.
  20. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583.
  21. Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568.
  22. LILA: A unified benchmark for mathematical reasoning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5807–5832.
  23. Numglue: A suite of fundamental yet challenging mathematical reasoning tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, pages 3505–3523.
  24. OpenAI. 2023. GPT-4 technical report. arXiv preprint arXiv:2303.08774.
  25. Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pages 2080–2094. Association for Computational Linguistics.
  26. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
  27. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  28. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  29. Iteratively prompt pre-trained language models for chain of thought. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2714–2730.
  30. Mathcoder: Seamless code integration in llms for enhanced mathematical reasoning. arXiv preprint arXiv:2310.03731.
  31. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations.
  32. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.
  33. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems 35, pages 24824–24837.
  34. Large language models are reasoners with self-verification. In Findings of the Association for Computational Linguistics: EMNLP, pages 2550–2575.
  35. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305.
  36. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601.
  37. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284.
  38. Scaling relationship on learning mathematical reasoning with large language models. arXiv preprint arXiv:2308.01825.
  39. Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653.
  40. Cumulative reasoning with large language models. arXiv preprint arXiv:2308.04371.
  41. Least-to-most prompting enables complex reasoning in large language models. In The Eleventh International Conference on Learning Representations.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Yongwei Zhou (8 papers)
  2. Tiejun Zhao (70 papers)
Citations (3)