MathPrompter: Mathematical Reasoning using Large Language Models (2303.05398v1)

Published 4 Mar 2023 in cs.CL and cs.AI

Abstract: LLMs have limited performance when solving arithmetic reasoning tasks and often provide incorrect answers. Unlike natural language understanding, math problems typically have a single correct answer, making the task of generating accurate solutions more challenging for LLMs. To the best of our knowledge, we are not aware of any LLMs that indicate their level of confidence in their responses which fuels a trust deficit in these models impeding their adoption. To address this deficiency, we propose `MathPrompter', a technique that improves performance of LLMs on arithmetic problems along with increased reliance in the predictions. MathPrompter uses the Zero-shot chain-of-thought prompting technique to generate multiple Algebraic expressions or Python functions to solve the same math problem in different ways and thereby raise the confidence level in the output results. This is in contrast to other prompt based CoT methods, where there is no check on the validity of the intermediate steps followed. Our technique improves over state-of-the-art on the MultiArith dataset ($78.7\%\rightarrow92.5\%$) evaluated using 175B parameter GPT-based LLM.

PDF Abstract

Overview of the Mathematical Reasoning Enhancement with MathPrompter

In the field of natural language processing, LLMs such as GPT-3 have heralded significant progression. However, these models have exhibited limitations in executing arithmetic reasoning tasks with consistent accuracy. The researchers at Microsoft Research have proposed MathPrompter as a remedy to this challenge, utilizing the principles of Zero-shot chain-of-thought (CoT) prompting to yield multiple algebraic operations or Python functions for solving mathematical problems.

Methodology and Approach

MathPrompter employs a series of steps commencing with the transfer of numeric values in an arithmetic problem into variable placeholders, thus creating a template. Subsequently, this template is confronted with multiple prompts enforcing the LLM to generate different analytical solutions. These solutions are then subjected to verification by evaluating each with randomly assigned variable values. Statistical significance is measured by looking for consensus among the analytical expressions over multiple evaluations. Only upon reaching consensus are the original numeric values reintroduced to deliver the final solution.

This method effectively breaks down a complex problem into less challenging steps for validation, akin to how human problem-solving occurs. By leveraging the state-of-the-art GPT-3 model, MathPrompter demonstrated an increase in the performance of LLM in solving arithmetic problems, achieving an accuracy of 92.5% on the MultiArith dataset, which starkly surpassed the previous CoT method's accuracy benchmark of 78.7%.

Experimental Validation and Baselines

The evaluation of MathPrompter involved comparing its performance with other prominent prompting methods on the MultiArith dataset; a subset of the Math World Problem Repository designed to test LLMs’ ability to carry out complex arithmetic and reasoning. Results showed that MathPrompter outperforms not only numerous baseline models, including the pioneering Zero-shot models, but also brings forth accuracy competitive to few-shot models with larger parameter counts, such as the PaLM model with 540B parameters. This strongly indicates the efficiency of the MathPrompter, considering its consistent results against the increased predictive power of larger models.

Conclusions and Future Perspectives

The MathPrompter model serves as an influential advancement in the utility of LLMs for arithmetic reasoning tasks, simultaneously expanding their credibility and trustworthiness. Its approach of using multiple analytical methods to solve problems—and using statistical convergence to verify correctness—aligns closely with rigorous academic scrutiny and intellectual rigor standards.

While MathPrompter significantly improves the accuracy of LLMs in arithmetic problems, it is not immune to errors. Discrepancies between the Algebraic and Pythonic expression results can still exist. Moving forward, the researchers aim to augment the capability of MathPrompter by integrating additional methods to fortify its validation process and by testing its efficacy across diverse datasets.

MathPrompter's innovation thus represents a substantial stride toward models that do not just simulate human-like problem-solving, but also embody the crucial aspect of confidence in the generated outcomes, establishing a robust foundation for future explorations into LLMs' reasoning capabilities.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Shima Imani (7 papers)
Liang Du (55 papers)
Harsh Shrivastava (16 papers)

Citations (161)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

YouTube

Show All Videos