- The paper introduces NumGLUE, a multi-task benchmark that highlights the brittleness of current AI models in arithmetic reasoning.
- It employs eight diverse tasks—including commonsense, domain-specific, and word problems—to expose a 46.4% performance gap between models and human reasoning.
- Joint training coupled with information retrieval improves task accuracy by an average of 3.4%, indicating promising directions for hybrid model development.
The paper introduces NumGLUE, a multi-task benchmark designed to evaluate the arithmetic reasoning capabilities of AI systems. It addresses the brittleness of current state-of-the-art models that struggle with simple mathematical reasoning when presented in varying scenarios. The benchmark comprises eight tasks that necessitate arithmetic understanding, drawing inspiration from the GLUE benchmark for natural language understanding.
The core motivation is to push AI systems towards robust and general arithmetic reasoning, viewed as a critical step towards handling more complex mathematical reasoning. The benchmark includes both newly curated tasks and existing datasets, totaling approximately 100,000 problems. The authors show that current neural models, including large-scale LLMs, perform significantly worse than humans on NumGLUE (46.4\% lower). The paper also demonstrates the benefits of joint training across tasks, which leads to knowledge sharing and superior performance (average gain of 3.4\% on each task) compared to task-specific training.
The NumGLUE benchmark consists of eight tasks:
- Task 1: Commonsense + Arithmetic Reasoning. This task combines commonsense knowledge with arithmetic operations. For instance, a question might require knowing the number of faces on a standard die to calculate the total faces of multiple dice.
- Task 2: Domain Specific + Arithmetic Reasoning. This task requires domain-specific knowledge in areas like chemistry and physics, combined with arithmetic skills. An example involves understanding chemical reactions to determine the quantity of reactants needed.
- Task 3: Commonsense + Quantitative Comparison. This task involves quantitative comparisons using commonsense knowledge. An example question asks which object has a higher gravitational force based on their masses.
- Task 4: Fill-in-the-blanks Format. This task presents arithmetic word problems in a fill-in-the-blanks format, testing the model's ability to understand and solve problems with a slightly different presentation style.
- Task 5: Reading Comprehension (RC) + Explicit Numerical Reasoning. This task selects questions from the DROP dataset [dua2019drop] that require reading comprehension and numerical reasoning to arrive at a numerical answer.
- Task 6: RC + Implicit Numerical Reasoning. This task also uses the DROP dataset but focuses on questions where numerical reasoning is implicitly required to arrive at a non-numerical answer, such as identifying a player with the shortest field goal.
- Task 7: Quantitative Natural Language Inference (NLI). This task utilizes the EQUATE dataset [ravichander2019equate] and presents quantitative NLI questions that demand arithmetic calculations to classify the relationship between a premise and a hypothesis.
- Task 8: Arithmetic Word Problems. This task includes traditional arithmetic word problems from multiple datasets [koncel2016mawps, koncel2015parsing, kushman2014learning], assessing the model's ability to solve standard mathematical problems expressed in natural language.
The performance of various baselines on the NumGLUE benchmark was evaluated, including heuristic approaches, zero-shot learning, few-shot learning, and fine-tuning. Both neuro-symbolic and end-to-end model architectures were tested. A memory-augmented neural network model extending NumNet+v2 [ran2019numnet] was proposed, along with the use of GPT3 [NEURIPS2020_1457c0d6]. The models were assessed using the F1 measure, and the aggregate performance on NumGLUE was reported as the unweighted average of F1 scores across all tasks.
Key findings from the experiments:
- The NumGLUE benchmark is challenging, with all baseline models performing significantly below human level.
- Task 1, requiring numerical commonsense knowledge, is the most difficult to solve. Tasks 2, 4, and 8 also pose significant challenges due to their reliance on accurate numerical calculations.
- Tasks 6 and 7 show relatively better performance, possibly due to the models' ability to handle span-based questions and multiple-choice formats more effectively.
- Information Retrieval (IR) helps improve performance on tasks 1, 2, and 4, where external knowledge is crucial, with Conditional Information Retrieval (CIR) leading to the strongest baseline.
- Oversampling to address data imbalance does not consistently improve performance across all tasks.
The error analysis reveals four main categories of errors: invalid output, copying numbers from the question, incorrect calculations, and redundant text generation. Incorrect calculations are the most frequent error type. GPT3 excels in producing valid outputs but tends to generate more redundant text compared to Ex-NumNet.
The paper suggests future research directions, including combining the strengths of end-to-end and neuro-symbolic models. Fine-tuned GPT3-13B outperforms other baselines on tasks 1, 2, and 3, likely due to its pre-training on a large text corpus. The smaller Ex-NumNet benefits from multi-task learning and IR, indicating that a combination of these approaches could lead to further improvements.
In summary, the NumGLUE benchmark presents a challenging suite of tasks for evaluating arithmetic reasoning in AI systems. The results highlight the limitations of current models and suggest avenues for future research, including the integration of external knowledge, multi-task learning, and hybrid model architectures.
Here are the variables used in the LaTeX formulas in this response:
y: This variable is used to represent the output of a function. It's the dependent variable.
f: This variable represents a function.
x: This variable is the input to the function f. It's the independent variable.