An Independent Evaluation of ChatGPT on Mathematical Word Problems (MWP) (2302.13814v2)

Published 23 Feb 2023 in cs.CL, cs.AI, and cs.LG

Abstract: We study the performance of a commercially available LLM known as ChatGPT on math word problems (MWPs) from the dataset DRAW-1K. To our knowledge, this is the first independent evaluation of ChatGPT. We found that ChatGPT's performance changes dramatically based on the requirement to show its work, failing 20% of the time when it provides work compared with 84% when it does not. Further several factors about MWPs relating to the number of unknowns and number of operations that lead to a higher probability of failure when compared with the prior, specifically noting (across all experiments) that the probability of failure increases linearly with the number of addition and subtraction operations. We also have released the dataset of ChatGPT's responses to the MWPs to support further work on the characterization of LLM performance and present baseline machine learning models to predict if ChatGPT can correctly answer an MWP. We have released a dataset comprised of ChatGPT's responses to support further research in this area.

Authors (4)

Paulo Shakarian (70 papers)
Abhinav Koyyalamudi (1 paper)
Noel Ngu (6 papers)
Lakshmivihari Mareedu (1 paper)

Citations (55)

View on Semantic Scholar

Summary

The paper assesses ChatGPT's ability to solve MWPs, revealing a failure rate drop from 84% to 20% when the model shows its work.
It examines how variables such as problem complexity and number of unknowns significantly influence ChatGPT’s performance.
The study uses machine learning baselines to predict errors, highlighting features that could guide future improvements in LLM problem-solving.

Analytical Assessment of ChatGPT's Proficiency on Mathematical Word Problems

The paper "An Independent Evaluation of ChatGPT on Mathematical Word Problems (MWP)" by Paulo Shakarian and colleagues presents a meticulous analysis of the performance of OpenAI's ChatGPT on mathematical word problems using the DRAW-1K dataset. This work represents the first independent evaluation of ChatGPT in this domain, strategically examining specific characteristics of mathematical word problems that influence the model's success or failure.

The authors conducted a series of experiments to assess the accuracy of ChatGPT when tasked with solving MWPs, introducing variations in the methodology to analyze different conditions. A key aspect of their investigation was to evaluate whether requiring ChatGPT to demonstrate its work affected its performance. In scenarios where the model was asked to show its work, its failure rate was significantly reduced to 20%, compared to an 84% failure rate when only the final answer was required.

Several factors were identified as increasing the likelihood of failure, including the number of unknowns and the complexity measured by the number of addition and subtraction operations. The latter exhibited a linear relationship with the probability of failure, indicating that multi-step problems pose a significant challenge for the LLM. Intriguingly, when ChatGPT was asked to show its work, the difficulty related to the number of unknowns appeared to diminish, though the complexity involving multiplication and division operations maintained a linear correlation with failure probability.

These findings are critical for advancing our understanding of the limitations of LLMs, particularly their difficulties in handling multi-step reasoning tasks. Furthermore, the authors released the dataset of ChatGPT's responses, encouraging further research into predictive models for assessing LLM performance on MWPs.

Baselines for performance prediction were also developed using machine learning techniques, utilizing ground-truth equations from DRAW-1K as an oracle. Despite the preliminary nature of these models, results indicated that certain features derived from the problems provide useful signals for predicting the model's performance, highlighting areas ripe for exploration in future studies.

The implications of this research could extend to practical applications involving LLMs, guiding the development of more robust problem-solving frameworks or systems that can predict when a model is likely to err, thereby enhancing reliability. As LLMs continue to be integrated into various domains, understanding their limitations and identifying the intricacies of tasks that challenge them will be imperative for their effective utilization.

In sum, this paper underscores the necessity of continued assessments of commercially available LLMs like ChatGPT, shedding light on potential avenues for refining their problem-solving capabilities. Future research could involve exploring similar evaluations on other emerging models by major technology companies and understanding the broader implications of these findings across different types of problem domains.

PDF Markdown

Related Papers

GitHub

GitHub - lab-v2/ChatGPT_MWP_eval: Independent evaluation of ChatGPT on MWP's (4 stars)

YouTube

Show All Videos