Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Code Prompting Elicits Conditional Reasoning Abilities in Text+Code LLMs (2401.10065v3)

Published 18 Jan 2024 in cs.CL
Code Prompting Elicits Conditional Reasoning Abilities in Text+Code LLMs

Abstract: Reasoning is a fundamental component of language understanding. Recent prompting techniques, such as chain of thought, have consistently improved LLMs' performance on various reasoning tasks. Nevertheless, there is still little understanding of what triggers reasoning abilities in LLMs in the inference stage. In this paper, we introduce code prompting, a chain of prompts that transforms a natural language problem into code and directly prompts the LLM using the generated code without resorting to external code execution. We hypothesize that code prompts can elicit certain reasoning capabilities of LLMs trained on text and code and utilize the proposed method to improve conditional reasoning, the ability to infer different conclusions depending on the fulfiLLMent of certain conditions. We find that code prompting exhibits a high-performance boost for multiple LLMs (up to 22.52 percentage points on GPT 3.5, 7.75 on Mixtral, and 16.78 on Mistral) across multiple conditional reasoning datasets. We then conduct comprehensive experiments to understand how code prompts trigger reasoning abilities and which capabilities are elicited in the underlying models. Our analysis of GPT 3.5 reveals that the code formatting of the input problem is essential for performance improvement. Furthermore, code prompts improve sample efficiency of in-context learning and facilitate state tracking of variables or entities.

Overview of Code Prompting

A paper investigates a novel approach to enhancing the conditional reasoning abilities of text+code LLMs, such as GPT 3.5. Through a process termed 'code prompting,' a natural language task is transformed into code, with the generated code used to prompt the LLM. This method leverages the LLM's capability to understand both textual and code inputs, aiming for performance improvements in tasks that require conditional reasoning.

Experimental Findings

The research outlines a clear performance improvement when using code prompts over traditional text prompts in reasoning tasks. This advancement is quantified as an increase between 2.6 and 7.7 points across different datasets—ConditionalQA and BoardgameQA. Significantly, code prompts do more than just transform text into code—they retain the natural language text within the produced code as comments, which is crucial for understanding the problem.

Investigation into Code Prompt Efficacy

The transformative methodology requires that the code not only takes on the structural form but also bears a close semantic resemblance to the original problem text. It is the alignment of the logic expressed in the code with the semantics of the text that unlocks the enhanced reasoning capabilities of the LLM. A pivotal finding is the superior efficiency of code prompts—they are found to require fewer examples (demonstrations) to guide the LLM towards correct reasoning, which makes them particularly advantageous in resource-constrained scenarios.

Implications and Future Potential

The technique showcases an increased ability of the LLM to track the state of variables or key entities throughout reasoning tasks. This implies an intrinsic advantage in facilitating logical operational tasks that deal with stateful or conditional information. Looking ahead, the researchers intend to investigate the application of this approach to other reasoning types and models, potentially broadening its utility across a more extensive range of LLM applications.

The method's main limitation lies in the necessity for an intermediate transformation step increasing the overall processing cost. However, the simplicity of the transformation holds promise for further optimization, such as outsourcing the task to a specialized but smaller model. Despite this, the research presents a compelling case for the role of code prompting in elevating the reasoning faculties of LLMs in conditional reasoning scenarios.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. Transactions on Machine Learning Research.
  2. FinQA: A dataset of numerical reasoning over financial data. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3697–3711, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  3. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
  4. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  5. Pal: program-aided language models. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org.
  6. Towards leveraging llms for conditional qa. arXiv preprint arXiv:2312.01143.
  7. Mistral 7b. arXiv preprint arXiv:2310.06825.
  8. BoardgameQA: A dataset for natural language reasoning with contradictory information. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, pages 1–23.
  9. Sharp nearby, fuzzy far away: How neural language models use context. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 284–294, Melbourne, Australia. Association for Computational Linguistics.
  10. Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems, volume 35, pages 22199–22213. Curran Associates, Inc.
  11. Measuring faithfulness in chain-of-thought reasoning. arXiv preprint arXiv:2307.13702.
  12. Logiqa 2.0—an improved dataset for logical reasoning in natural language understanding. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:2947–2962.
  13. Rainier: Reinforced knowledge introspector for commonsense question answering. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8938–8958, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  14. Generated knowledge prompting for commonsense reasoning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3154–3169, Dublin, Ireland. Association for Computational Linguistics.
  15. Logiqa: A challenge dataset for machine reading comprehension with logical reasoning. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20, pages 3622–3628. International Joint Conferences on Artificial Intelligence Organization. Main track.
  16. The magic of IF: Investigating causal reasoning abilities in large language models of code. In Findings of the Association for Computational Linguistics: ACL 2023, pages 9009–9022, Toronto, Canada. Association for Computational Linguistics.
  17. Faithful chain-of-thought reasoning. arXiv preprint arXiv:2301.13379.
  18. Language models of code are few-shot commonsense learners. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1384–1403, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  19. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, Singapore. Association for Computational Linguistics.
  20. OpenAI. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  21. Fact-checking complex claims with program-guided reasoning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6981–7004, Toronto, Canada. Association for Computational Linguistics.
  22. Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080–2094, Online. Association for Computational Linguistics.
  23. Interpretation of natural language rules in conversational machine reading. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2087–2097, Brussels, Belgium. Association for Computational Linguistics.
  24. CLUTRR: A diagnostic benchmark for inductive reasoning from text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4506–4515, Hong Kong, China. Association for Computational Linguistics.
  25. ConditionalQA: A complex reading comprehension dataset with conditional answers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3627–3637, Dublin, Ireland. Association for Computational Linguistics.
  26. Do long-range language models actually use long-range context? In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 807–822, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  27. Elaboration-generating commonsense question answering at scale. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1619–1635, Toronto, Canada. Association for Computational Linguistics.
  28. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc.
  29. Satlm: Satisfiability-aided language models using declarative prompting. In Proceedings of NeurIPS, pages 1–33.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Haritz Puerto (11 papers)
  2. Martin Tutek (10 papers)
  3. Somak Aditya (25 papers)
  4. Xiaodan Zhu (94 papers)
  5. Iryna Gurevych (264 papers)
Citations (7)