Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Chain of Code: Reasoning with a Language Model-Augmented Code Emulator (2312.04474v4)

Published 7 Dec 2023 in cs.CL, cs.AI, cs.LG, and cs.RO
Chain of Code: Reasoning with a Language Model-Augmented Code Emulator

Abstract: Code provides a general syntactic structure to build complex programs and perform precise computations when paired with a code interpreter - we hypothesize that LLMs (LMs) can leverage code-writing to improve Chain of Thought reasoning not only for logic and arithmetic tasks, but also for semantic ones (and in particular, those that are a mix of both). For example, consider prompting an LM to write code that counts the number of times it detects sarcasm in an essay: the LM may struggle to write an implementation for "detect_sarcasm(string)" that can be executed by the interpreter (handling the edge cases would be insurmountable). However, LMs may still produce a valid solution if they not only write code, but also selectively "emulate" the interpreter by generating the expected output of "detect_sarcasm(string)". In this work, we propose Chain of Code (CoC), a simple yet surprisingly effective extension that improves LM code-driven reasoning. The key idea is to encourage LMs to format semantic sub-tasks in a program as flexible pseudocode that the interpreter can explicitly catch undefined behaviors and hand off to simulate with an LM (as an "LMulator"). Experiments demonstrate that Chain of Code outperforms Chain of Thought and other baselines across a variety of benchmarks; on BIG-Bench Hard, Chain of Code achieves 84%, a gain of 12% over Chain of Thought. In a nutshell, CoC broadens the scope of reasoning questions that LMs can answer by "thinking in code".

Introduction

LLMs (LMs) have reached a remarkable ability to solve complex reasoning problems across various areas, such as mathematics and science. The advent of approaches like Chain of Thought (CoT), which breaks down complex questions into sequences of intermediate reasoning steps, has advanced LM capabilities in semantic reasoning tasks. However, LMs often struggle with questions requiring a mix of semantic understanding and precise numeric or symbolic reasoning. This gap has been partially addressed by prompting LMs, such as those trained on coding data, to write and execute code. Although effective for arithmetic tasks, the application to broader semantic tasks remains challenging, especially when these tasks are difficult to express in executable code, like detecting sarcasm. This paper introduces "Chain of Code" (CoC), a new method designed to enhance LM code-driven reasoning by combining the structured approach of code with the ability to "think in code," which includes generating pseudocode and executing or simulating the execution of code snippets.

Chain of Code Methodology

Chain of Code promotes a two-step process. First, the LM is prompted to generate reasoning steps in the form of code, including executable code, pseudocode, or natural language. Second, the method involves execution, where generated code is executed by a code interpreter wherever possible, and an LM-augmented code emulator (an "LMulator") is used when the code is unexecutable. The LMulator essentially simulates the result of a code snippet that a standard interpreter can't execute directly. This simulation can incorporate chain-of-thought reasoning to determine outputs, effectively handling cases where LMs need to emulate an interpreter's behavior.

Implementing Chain of Code

Implementing CoC involves a simple but effective system using Python's try and except blocks, along with a program state to manage execution flow. The code is evaluated line by line. If executable, it runs and updates the program state; if not, the LM predicts the output by considering the program context, including previously run code and the state history, and generates the next program state. Through this mechanism, the LM and code execution become tightly integrated, facilitating complex reasoning that leverages the computational prowess of code along with the nuanced understanding of LMs.

Experimental Evaluation

Chain of Code has been experimentally evaluated on various benchmarks, including "BIG-Bench Hard," where it displayed superior performance over other prevalent reasoning methods, setting a new state of the art by achieving an 84% success rate. The approach scales effectively across large and small models, suggesting it is less dependent on model size. CoC also excels as a general-purpose reasoner, capable of tackling cross-task prompts where context differs from the problem at hand, presenting a promising direction toward versatile reasoning applications. Comparisons with instruction-tuned models reaffirm CoC's robustness, even when juxtaposed with the latest LM capabilities. Additionally, the method shows promising applications in robotics for tasks that intertwine semantic reasoning, algorithmic reasoning, and interaction with coding APIs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
  2. Graph of thoughts: Solving elaborate problems with large language models. arXiv preprint arXiv:2308.09687, 2023.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  4. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  5. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588, 2022.
  6. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  7. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  8. Language model cascades. arXiv preprint arXiv:2207.10342, 2022.
  9. A neural network solves, explains, and generates university math problems by program synthesis and few-shot learning at human level. Proceedings of the National Academy of Sciences, 119(32):e2123433119, 2022.
  10. Pal: Program-aided language models. In International Conference on Machine Learning, pp. 10764–10799. PMLR, 2023.
  11. Google Gemini Team. Gemini: A family of highly capable multimodal models. 2023. URL https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf.
  12. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  13. Critic: Large language models can self-correct with tool-interactive critiquing. arXiv preprint arXiv:2305.11738, 2023.
  14. Decomposed prompting: A modular approach for solving complex tasks. arXiv preprint arXiv:2210.02406, 2022.
  15. Segment anything. arXiv:2304.02643, 2023.
  16. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
  17. Solving quantitative reasoning problems with language models, 2022. 2022. URL https://arxiv.org/abs/2206.14858.
  18. Competition-level code generation with alphacode. Science, 378(6624):1092–1097, 2022.
  19. Code as policies: Language model programs for embodied control. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp.  9493–9500. IEEE, 2023.
  20. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
  21. Augmented language models: a survey. arXiv preprint arXiv:2302.07842, 2023.
  22. Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837, 2022.
  23. Large language models as general pattern machines. arXiv preprint arXiv:2307.04721, 2023.
  24. Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474, 2022.
  25. Skeleton-of-thought: Large language models can do parallel decoding. arXiv preprint arXiv:2307.15337, 2023.
  26. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114, 2021.
  27. OpenAI. Gpt-4 technical report, 2023.
  28. Art: Automatic multi-step reasoning and tool-use for large language models. arXiv preprint arXiv:2303.09014, 2023.
  29. Talm: Tool augmented language models. arXiv preprint arXiv:2205.12255, 2022.
  30. Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789, 2023.
  31. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  32. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023.
  33. Progprompt: Generating situated robot task plans using large language models. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp.  11523–11530. IEEE, 2023.
  34. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
  35. Vipergpt: Visual inference via python execution for reasoning. arXiv preprint arXiv:2303.08128, 2023.
  36. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
  37. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  38. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023a.
  39. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. arXiv preprint arXiv:2305.04091, 2023b.
  40. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
  41. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022a.
  42. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022b.
  43. Large language models as optimizers. arXiv preprint arXiv:2309.03409, 2023.
  44. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022.
  45. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023.
  46. Socratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598, 2022.
  47. Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification. arXiv preprint arXiv:2308.07921, 2023.
  48. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022a.
  49. Mixture-of-experts with expert choice routing. Advances in Neural Information Processing Systems, 35:7103–7114, 2022b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Chengshu Li (32 papers)
  2. Jacky Liang (21 papers)
  3. Andy Zeng (54 papers)
  4. Xinyun Chen (80 papers)
  5. Karol Hausman (56 papers)
  6. Dorsa Sadigh (162 papers)
  7. Sergey Levine (531 papers)
  8. Li Fei-Fei (199 papers)
  9. Fei Xia (111 papers)
  10. Brian Ichter (52 papers)
Citations (50)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub

Youtube Logo Streamline Icon: https://streamlinehq.com