Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification

Published 15 Aug 2023 in cs.CL, cs.AI, and cs.CV | (2308.07921v1)

Abstract: Recent progress in LLMs like GPT-4 and PaLM-2 has brought significant advancements in addressing math reasoning problems. In particular, OpenAI's latest version of GPT-4, known as GPT-4 Code Interpreter, shows remarkable performance on challenging math datasets. In this paper, we explore the effect of code on enhancing LLMs' reasoning capability by introducing different constraints on the \textit{Code Usage Frequency} of GPT-4 Code Interpreter. We found that its success can be largely attributed to its powerful skills in generating and executing code, evaluating the output of code execution, and rectifying its solution when receiving unreasonable outputs. Based on this insight, we propose a novel and effective prompting method, explicit \uline{c}ode-based \uline{s}elf-\uline{v}erification~(CSV), to further boost the mathematical reasoning potential of GPT-4 Code Interpreter. This method employs a zero-shot prompt on GPT-4 Code Interpreter to encourage it to use code to self-verify its answers. In instances where the verification state registers as ``False'', the model shall automatically amend its solution, analogous to our approach of rectifying errors during a mathematics examination. Furthermore, we recognize that the states of the verification result indicate the confidence of a solution, which can improve the effectiveness of majority voting. With GPT-4 Code Interpreter and CSV, we achieve an impressive zero-shot accuracy on MATH dataset \textbf{(53.9\% $\to$ 84.3\%)}.

Abstract PDF Upgrade to Chat

Authors (11)

Citations (126)

View on Semantic Scholar

Summary

The paper presents a novel code-based self-verification (CSV) method that enables GPT-4 Code Interpreter to autonomously check and refine its math problem solutions.
By leveraging iterative code execution, the approach elevates performance on the MATH dataset from 42.2% to a state-of-the-art 84.32%.
The study highlights the potential for integrating self-verification in LLMs to improve reliability in fields like finance, engineering, and scientific research.

Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification: An Expert Overview

In the paper "Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification," the authors investigate the proficiencies of GPT-4 Code Interpreter in solving complex mathematical word problems. Their approach introduces a novel technique called code-based self-verification (CSV), enhancing the LLM's reasoning capacities by allowing it to verify its solutions autonomously through code execution, aiming to improve accuracy in resolving math reasoning tasks.

The researchers embarked on this problem by examining the limitations of existing LLMs in mathematical reasoning. They identified the key challenges: LLMs frequently produce incorrect or irrational results due to the inherent complexity of math problems and their dependence on textual reasoning alone. Previous solutions include the Chain-of-Thought (CoT) framework, which utilizes intermediate reasoning steps to enhance logical capabilities, and PAL, which leverages Python code for improved computational accuracy.

Focusing on GPT-4 Code Interpreter, the team conducted a series of controlled experiments to understand the role of code generation and execution in resolving math problems. By manipulating the "Code Usage Frequency," they revealed a strong correlation between frequent code utilization and problem-solving performance. The results demonstrate that GPT-4 Code Interpreter achieved a significant accuracy increase from 42.2% to 69.7% on the MATH dataset, showcasing the model's superior capabilities over its predecessor.

The paper proposes explicit CSV as a novel prompt methodology for GPT-4 Code Interpreter, aimed at using code for solution verification. This self-verification process allows the model to evaluate the reasonableness of its solutions and make adjustments as required. Specifically, when the verification process deems an answer "False," the model iteratively refines its approach until it arrives at a correct solution, a behavior similar to debugging. This process notably enhances accuracy from 69.7% to 73.54% on the MATH dataset.

The authors extend their methodology by integrating verification-guided weighted majority voting. This refinement employs verification states as weights in a majority voting framework, further increasing the solution reliability by prioritizing solutions verified as "True." This advancement led to achieving a state-of-the-art accuracy of 84.32% on the MATH dataset, underscoring the efficacy of the proposed system.

This paper's contributions are twofold: a systematic analysis of the GPT-4 Code Interpreter's capabilities, emphasizing step-by-step code generation, and the introduction of robust self-verification techniques that exploit these capabilities. The authors also provide empirical evidence through CSV-resistant state-of-the-art results on key math datasets, demonstrating that internal, mechanism-guided verification markedly enhances performance.

The implications of this research are profound. Practically, it can lead to more reliable deployment of LLMs in fields requiring precise mathematical computations, such as finance, engineering, and scientific research. Theoretically, the methodology opens avenues for further exploration of self-verification and error correction within LLMs, potentially exploring multi-modal problem-solving approaches integrating visual and contextual cues. The authors plan to continue this line of work by fine-tuning open-source LLMs using the generated datasets to further bolster their mathematical reasoning capacities.

In conclusion, this paper highlights the significant potential of CSV in augmenting LLMs' mathematical reasoning and sets a promising path for future enhancements in natural language processing with a focus on self-sufficient solution verification and refinement.

Markdown Report Issue