Chain of Code: Reasoning with a Language Model-Augmented Code Emulator (2312.04474v4)

Published 7 Dec 2023 in cs.CL, cs.AI, cs.LG, and cs.RO

Abstract: Code provides a general syntactic structure to build complex programs and perform precise computations when paired with a code interpreter - we hypothesize that LLMs (LMs) can leverage code-writing to improve Chain of Thought reasoning not only for logic and arithmetic tasks, but also for semantic ones (and in particular, those that are a mix of both). For example, consider prompting an LM to write code that counts the number of times it detects sarcasm in an essay: the LM may struggle to write an implementation for "detect_sarcasm(string)" that can be executed by the interpreter (handling the edge cases would be insurmountable). However, LMs may still produce a valid solution if they not only write code, but also selectively "emulate" the interpreter by generating the expected output of "detect_sarcasm(string)". In this work, we propose Chain of Code (CoC), a simple yet surprisingly effective extension that improves LM code-driven reasoning. The key idea is to encourage LMs to format semantic sub-tasks in a program as flexible pseudocode that the interpreter can explicitly catch undefined behaviors and hand off to simulate with an LM (as an "LMulator"). Experiments demonstrate that Chain of Code outperforms Chain of Thought and other baselines across a variety of benchmarks; on BIG-Bench Hard, Chain of Code achieves 84%, a gain of 12% over Chain of Thought. In a nutshell, CoC broadens the scope of reasoning questions that LMs can answer by "thinking in code".

PDF HTML Abstract

Introduction

LLMs (LMs) have reached a remarkable ability to solve complex reasoning problems across various areas, such as mathematics and science. The advent of approaches like Chain of Thought (CoT), which breaks down complex questions into sequences of intermediate reasoning steps, has advanced LM capabilities in semantic reasoning tasks. However, LMs often struggle with questions requiring a mix of semantic understanding and precise numeric or symbolic reasoning. This gap has been partially addressed by prompting LMs, such as those trained on coding data, to write and execute code. Although effective for arithmetic tasks, the application to broader semantic tasks remains challenging, especially when these tasks are difficult to express in executable code, like detecting sarcasm. This paper introduces "Chain of Code" (CoC), a new method designed to enhance LM code-driven reasoning by combining the structured approach of code with the ability to "think in code," which includes generating pseudocode and executing or simulating the execution of code snippets.

Chain of Code Methodology

Chain of Code promotes a two-step process. First, the LM is prompted to generate reasoning steps in the form of code, including executable code, pseudocode, or natural language. Second, the method involves execution, where generated code is executed by a code interpreter wherever possible, and an LM-augmented code emulator (an "LMulator") is used when the code is unexecutable. The LMulator essentially simulates the result of a code snippet that a standard interpreter can't execute directly. This simulation can incorporate chain-of-thought reasoning to determine outputs, effectively handling cases where LMs need to emulate an interpreter's behavior.

Implementing Chain of Code

Implementing CoC involves a simple but effective system using Python's try and except blocks, along with a program state to manage execution flow. The code is evaluated line by line. If executable, it runs and updates the program state; if not, the LM predicts the output by considering the program context, including previously run code and the state history, and generates the next program state. Through this mechanism, the LM and code execution become tightly integrated, facilitating complex reasoning that leverages the computational prowess of code along with the nuanced understanding of LMs.

Experimental Evaluation

Chain of Code has been experimentally evaluated on various benchmarks, including "BIG-Bench Hard," where it displayed superior performance over other prevalent reasoning methods, setting a new state of the art by achieving an 84% success rate. The approach scales effectively across large and small models, suggesting it is less dependent on model size. CoC also excels as a general-purpose reasoner, capable of tackling cross-task prompts where context differs from the problem at hand, presenting a promising direction toward versatile reasoning applications. Comparisons with instruction-tuned models reaffirm CoC's robustness, even when juxtaposed with the latest LM capabilities. Additionally, the method shows promising applications in robotics for tasks that intertwine semantic reasoning, algorithmic reasoning, and interaction with coding APIs.