Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

How Do Humans Write Code? Large Models Do It the Same Way Too (2402.15729v3)

Published 24 Feb 2024 in cs.AI, cs.CL, and cs.PL

Abstract: Program-of-Thought (PoT) replaces natural language-based Chain-of-Thought (CoT) as the most popular method in LLMs mathematical reasoning tasks by utilizing external tool calls to circumvent computational errors. However, our evaluation of the GPT-4 and Llama series reveals that using PoT introduces more reasoning errors, such as incorrect formulas or flawed logic, compared to CoT. To address this issue, we propose Human-Think Language (HTL), which leverages a suite of strategies that help integrate PoT and CoT, encompassing: (1) a new generation paradigm that uses full CoT reasoning to control code generation. (2) Focus Attention, that directs model attention to the CoT reasoning during PoT to generate more logical code. (3) reinforcement learning that utilizes the accuracy of both CoT and PoT responses as rewards to prevent repetitive reasoning steps in LLMs when solving difficult math problems. Our method achieves an average improvement of 6.5% on the Llama-Base model and 4.3% on the Mistral-Base model across 8 mathematical calculation datasets. It also shows significant effectiveness on five out-of-domain datasets by controlling the model's information flow, exhibiting strong transferability. Additionally, HTL shows the most significant improvement in non-mathematical natural language inference task, contributing to a unified reasoning task framework

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Large language models for mathematical reasoning: Progresses and challenges.
  2. Palm 2 technical report.
  3. Constitutional ai: Harmlessness from ai feedback.
  4. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150.
  5. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. Transactions on Machine Learning Research.
  6. Training verifiers to solve math word problems.
  7. Rephrase and respond: Let large language models ask better questions for themselves.
  8. Faith and fate: Limits of transformers on compositionality.
  9. Pal: Program-aided language models.
  10. Think before you speak: Training language models with pause tokens.
  11. On the connection between local attention and dynamic depth-wise convolution. arXiv preprint arXiv:2106.04263.
  12. Measuring mathematical problem solving with the MATH dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
  13. Mathprompter: Mathematical reasoning using large language models.
  14. Mistral 7b.
  15. MAWPS: A math word problem repository. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1152–1157, San Diego, California. Association for Computational Linguistics.
  16. Solomon Kullback and Richard A Leibler. 1951. On information and sufficiency. The annals of mathematical statistics, 22(1):79–86.
  17. Solving quantitative reasoning problems with language models.
  18. A survey of deep learning for mathematical reasoning. arXiv preprint arXiv:2212.10535.
  19. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct.
  20. Reft: Reasoning with reinforced fine-tuning.
  21. Language models of code are few-shot commonsense learners. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1384–1403, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  22. Selfcheck: Using llms to zero-shot check their own step-by-step reasoning.
  23. NumGLUE: A suite of fundamental yet challenging mathematical reasoning tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3505–3523, Dublin, Ireland. Association for Computational Linguistics.
  24. Show your work: Scratchpads for intermediate computation with language models.
  25. Gpt-4 technical report.
  26. Slide-transformer: Hierarchical vision transformer with local self-attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2082–2091.
  27. Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080–2094, Online. Association for Computational Linguistics.
  28. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’20, page 3505–3506, New York, NY, USA. Association for Computing Machinery.
  29. Mathematical discoveries from program search with large language models. Nature, pages 1–3.
  30. Code llama: Open foundation models for code.
  31. Analysing mathematical reasoning abilities of neural models. In International Conference on Learning Representations.
  32. Proximal policy optimization algorithms.
  33. Galactica: A large language model for science.
  34. Alberto Testolin. 2024. Can neural networks do arithmetic? a survey on the elementary numerical skills of state-of-the-art deep learning models. Applied Sciences, 14(2):744.
  35. Llama 2: Open foundation and fine-tuned chat models.
  36. Mathcoder: Seamless code integration in llms for enhanced mathematical reasoning.
  37. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2609–2634, Toronto, Canada. Association for Computational Linguistics.
  38. Chain of thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems.
  39. Chain-of-thought prompting elicits reasoning in large language models.
  40. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453.
  41. Mammoth: Building math generalist models through hybrid instruction tuning.
  42. Secrets of RLHF in large language models part I: PPO. CoRR, abs/2307.04964.
  43. Fine-tuning language models from human preferences.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Long Li (113 papers)
  2. Xuzheng He (6 papers)
  3. Haozhe Wang (64 papers)
  4. Linlin Wang (35 papers)
  5. Liang He (202 papers)
Citations (1)