Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

When Do Program-of-Thoughts Work for Reasoning? (2308.15452v6)

Published 29 Aug 2023 in cs.CL, cs.AI, cs.LG, and cs.SE
When Do Program-of-Thoughts Work for Reasoning?

Abstract: In the realm of embodied artificial intelligence, the reasoning capabilities of LLMs play a pivotal role. Although there are effective methods like program-of-thought prompting for LLMs which uses programming language to tackle complex reasoning tasks, the specific impact of code data on the improvement of reasoning capabilities remains under-explored. To address this gap, we propose complexity-impacted reasoning score (CIRS), which combines structural and logical attributes, to measure the correlation between code and reasoning abilities. Specifically, we use the abstract syntax tree to encode the structural information and calculate logical complexity by considering the difficulty and the cyclomatic complexity. Through an empirical analysis, we find not all code data of complexity can be learned or understood by LLMs. Optimal level of complexity is critical to the improvement of reasoning abilities by program-aided prompting. Then we design an auto-synthesizing and stratifying algorithm, and apply it to instruction generation for mathematical reasoning and code data filtering for code generation tasks. Extensive results demonstrates the effectiveness of our proposed approach. Code will be integrated into the EasyInstruct framework at https://github.com/zjunlp/EasyInstruct.

An Analysis of the "When Do Program-of-Thought Works for Reasoning?" Paper

This paper investigates the reasoning capabilities of LLMs in the context of embodied artificial intelligence, specifically through the lens of program-of-thought prompting. The authors address an under-explored aspect of how code data impacts the enhancement of reasoning capabilities in LLMs. By introducing the Complexity-Impacted Reasoning Score (CIRS), the paper provides a metric that correlates code and reasoning abilities by integrating structural and logical attributes of code.

Key Contributions

The primary contribution of this work is the development of the Complexity-Impacted Reasoning Score. This metric evaluates code reasoning steps by utilizing abstract syntax trees (AST) to encode structural information and Halstead and McCabe's theories to assess logical complexity. The integration of these elements allows for a nuanced understanding of which complexities in code are most beneficial for reasoning tasks in LLMs.

Additionally, the authors propose an auto-synthesizing and stratifying algorithm designed to optimize instruction generation for mathematical reasoning and to filter code data for generation tasks. This approach aims to investigate the impact of code complexity on LLM performance systematically.

Empirical Analysis

The paper presents a rigorous empirical analysis of how different complexities in code data affect reasoning abilities. The findings reveal that LLMs do not uniformly learn from all complexities of code data. It identifies that code with an optimal level of complexity—neither too simple nor too complex—facilitates the most effective enhancement of LLM reasoning capabilities.

The empirical results suggest that as model parameters grow, LLMs exhibit improved reasoning abilities, aligning with current understandings of LLM proficiency. However, current LLM architectures encounter limitations when reasoning about complex symbolic knowledge, illustrating an area for potential future exploration in model design.

Implications and Future Directions

The implications of this paper are twofold. Practically, it offers a methodology to enhance the reasoning skills of LLMs through well-curated code data, enabling more effective program-of-thought prompting methods. Theoretically, it provides insights into the relationship between code complexity and reasoning ability, suggesting that future developments in AI might benefit from exploring architectures that natively understand complex, structured data more effectively.

This research opens avenues for further exploration in the design and application of LLMs, particularly in environments requiring intricate reasoning capabilities. Future work could extend these findings to other domains such as commonsense reasoning, integrating advanced model architectures or leveraging external tools to support complex reasoning tasks.

In conclusion, this paper contributes a detailed framework for understanding and optimizing the reasoning capabilities of LLMs through program-of-thought prompting. Its findings offer valuable insights into the importance of data complexity in model training, guiding future research towards enhanced reasoning methodologies in artificial intelligence.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Falcon-40B: an open large language model with state-of-the-art performance.
  2. PaLM 2 Technical Report. arXiv:2305.10403.
  3. CodeKGC: Code Language Model for Generative Knowledge Graph Construction. CoRR, abs/2304.09048.
  4. Language Models are Few-Shot Learners. In Larochelle, H.; Ranzato, M.; Hadsell, R.; Balcan, M.; and Lin, H., eds., Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  5. Chaudhary, S. 2023. Code Alpaca: An Instruction-following LLaMA model for code generation. https://github.com/sahil280114/codealpaca.
  6. Evaluating Large Language Models Trained on Code. CoRR, abs/2107.03374.
  7. Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks. CoRR, abs/2211.12588.
  8. KnowPrompt: Knowledge-aware Prompt-tuning with Synergistic Optimization for Relation Extraction. In Laforest, F.; Troncy, R.; Simperl, E.; Agarwal, D.; Gionis, A.; Herman, I.; and Médini, L., eds., WWW ’22: The ACM Web Conference 2022, Virtual Event, Lyon, France, April 25 - 29, 2022, 2778–2788. ACM.
  9. Tele-Knowledge Pre-training for Fault Analysis. arXiv:2210.11298.
  10. Binding Language Models in Symbolic Languages. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
  11. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality.
  12. Training Verifiers to Solve Math Word Problems. CoRR, abs/2110.14168.
  13. Conklin, J. 2005. A taxonomy for learning, teaching, and assessing: A revision of Bloom’s taxonomy of educational objectives complete edition.
  14. Chain-of-Thought Hub: A Continuous Effort to Measure Large Language Models’ Reasoning Performance. CoRR, abs/2305.17306.
  15. Specializing Smaller Language Models towards Multi-Step Reasoning. CoRR, abs/2301.12726.
  16. Complexity-Based Prompting for Multi-step Reasoning. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
  17. PAL: Program-aided Language Models. CoRR, abs/2211.10435.
  18. Large Language Models Are Not Abstract Reasoners. CoRR, abs/2305.19555.
  19. Haladyna, T. M. 1997. Writing Test Items to Evaluate Higher Order Thinking. ERIC.
  20. Halstead, M. H. 1977. Elements of Software Science (Operating and programming systems series). Elsevier Science Inc.
  21. Measuring Mathematical Problem Solving With the MATH Dataset. In Vanschoren, J.; and Yeung, S., eds., Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual.
  22. Code Prompting: a Neural Symbolic Method for Complex Reasoning in Large Language Models. CoRR, abs/2305.18507.
  23. Towards Reasoning in Large Language Models: A Survey. CoRR, abs/2212.10403.
  24. VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models. CoRR, abs/2307.05973.
  25. Inner Monologue: Embodied Reasoning through Planning with Language Models. In Liu, K.; Kulic, D.; and Ichnowski, J., eds., Conference on Robot Learning, CoRL 2022, 14-18 December 2022, Auckland, New Zealand, volume 205 of Proceedings of Machine Learning Research, 1769–1782. PMLR.
  26. MathPrompter: Mathematical Reasoning using Large Language Models. CoRR, abs/2303.05398.
  27. CodeIE: Large Code Generation Models are Better Few-Shot Information Extractors. CoRR, abs/2305.05711.
  28. Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems. In Barzilay, R.; and Kan, M., eds., Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, 158–167. Association for Computational Linguistics.
  29. The Magic of IF: Investigating Causal Reasoning Abilities in Large Language Models of Code. CoRR, abs/2305.19213.
  30. Language Models of Code are Few-Shot Commonsense Learners. CoRR, abs/2210.07128.
  31. McCabe, T. J. 1976. A Complexity Measure. IEEE Trans. Software Eng., 2(4): 308–320.
  32. A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers. In Jurafsky, D.; Chai, J.; Schluter, N.; and Tetreault, J. R., eds., Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, 975–984. Association for Computational Linguistics.
  33. Orca: Progressive Learning from Complex Explanation Traces of GPT-4. CoRR, abs/2306.02707.
  34. OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774.
  35. Are NLP Models really able to Solve Simple Math Word Problems? In Toutanova, K.; Rumshisky, A.; Zettlemoyer, L.; Hakkani-Tür, D.; Beltagy, I.; Bethard, S.; Cotterell, R.; Chakraborty, T.; and Zhou, Y., eds., Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, 2080–2094. Association for Computational Linguistics.
  36. Why think step-by-step? Reasoning emerges from the locality of experience. CoRR, abs/2304.03843.
  37. Reasoning with Language Model Prompting: A Survey. In ACL. The Association for Computational Linguistics.
  38. Solving General Arithmetic Word Problems. In Màrquez, L.; Callison-Burch, C.; Su, J.; Pighin, D.; and Marton, Y., eds., Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, 1743–1752. The Association for Computational Linguistics.
  39. The Right Tool for the Job: Matching Model and Instance Complexities. arXiv:2004.07453.
  40. Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them. CoRR, abs/2210.09261.
  41. LLaMA: Open and Efficient Foundation Language Models. CoRR, abs/2302.13971.
  42. Can NLP Models Correctly Reason Over Contexts that Break the Common Assumptions? CoRR, abs/2305.12096.
  43. Voyager: An Open-Ended Embodied Agent with Large Language Models. CoRR, abs/2305.16291.
  44. Making Large Language Models Better Reasoners with Alignment. arXiv:2309.02144.
  45. Code4Struct: Code Generation for Few-Shot Structured Prediction from Natural Language. CoRR, abs/2210.12810.
  46. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In NeurIPS.
  47. Measuring Association Between Labels and Free-Text Rationales. In Moens, M.; Huang, X.; Specia, L.; and Yih, S. W., eds., Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, 10266–10284. Association for Computational Linguistics.
  48. Decomposition Enhances Reasoning via Self-Evaluation Guided Decoding. CoRR, abs/2305.00633.
  49. LogicSolver: Towards Interpretable Math Word Problem Solving with Logical Prompt-enhanced Learning. In Goldberg, Y.; Kozareva, Z.; and Zhang, Y., eds., Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, 1–13. Association for Computational Linguistics.
  50. Scaling Relationship on Learning Mathematical Reasoning with Large Language Models. arXiv:2308.01825.
  51. The Impact of Symbolic Representations on In-context Learning for Few-shot Reasoning. CoRR, abs/2212.08686.
  52. A Survey of Large Language Models. CoRR, abs/2303.18223.
  53. PaD: Program-aided Distillation Specializes Large Models in Reasoning. CoRR, abs/2305.13888.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Zhen Bi (67 papers)
  2. Ningyu Zhang (148 papers)
  3. Yinuo Jiang (3 papers)
  4. Shumin Deng (65 papers)
  5. Guozhou Zheng (6 papers)
  6. Huajun Chen (198 papers)
Citations (15)
Github Logo Streamline Icon: https://streamlinehq.com