Mathador-LM: A Dynamic Benchmark for Mathematical Reasoning on Large Language Models (2406.12572v3)

Published 18 Jun 2024 in cs.CL, cs.AI, and cs.LG

Abstract: We introduce Mathador-LM, a new benchmark for evaluating the mathematical reasoning on LLMs, combining ruleset interpretation, planning, and problem-solving. This benchmark is inspired by the Mathador game, where the objective is to reach a target number using basic arithmetic operations on a given set of base numbers, following a simple set of rules. We show that, across leading LLMs, we obtain stable average performance while generating benchmark instances \emph{dynamically}, following a target difficulty level. Thus, our benchmark alleviates concerns about test-set leakage into training data, an issue that often undermines popular benchmarks. Additionally, we conduct a comprehensive evaluation of both open and closed-source state-of-the-art LLMs on Mathador-LM. Our findings reveal that contemporary models struggle with Mathador-LM, scoring significantly lower than average 3rd graders. This stands in stark contrast to their strong performance on popular mathematical reasoning benchmarks. The implementation of Mathador-LM benchmark is available at \href{https://github.com/IST-DASLab/Mathador-LM}{github.com/IST-DASLab/Mathador-LM}.

PDF HTML Abstract

Mathador-LM: A New Paradigm for Mathematical Reasoning Evaluation in LLMs

The presented paper introduces Mathador-LM, an innovative benchmark specifically designed to evaluate mathematical reasoning skills in LLMs. Mathador-LM addresses the prevalent issues of performance saturation and potential data contamination in existing benchmarks, such as Grade-School Math (GSM) and MATH. Recognizing the need for a benchmark that remains challenging and contamination-resilient, Mathador-LM draws inspiration from the Mathador game, which combines elements of ruleset interpretation, planning, and problem-solving.

Benchmark Composition and Evaluation Criteria

Mathador-LM is structured around the concept of reaching a target number through basic arithmetic operations on a set of given base numbers. Each instance is dynamically generated based on specified difficulty levels, which mitigates risks connected with test-set leakage and overfitting. Through this mechanism, Mathador-LM enables comprehensive evaluations against unstably memorized problem solutions which could otherwise undermine an LLM’s claimed proficiency.

The paper demonstrates Mathador-LM’s utility by subjecting various cutting-edge LLMs—both open-source like LLaMA3 and Qwen2, and proprietary models such as Claude and GPT3.5/4—to rigorous evaluation. A consistent finding across these models was an underperformance on Mathador-LM compared to existing benchmarks, with accuracy scores significantly lower than those of average third-grade students, challenging contemporary models' mathematical reasoning capabilities.

Key Observations and Numerical Findings

Overall Difficulty: Mathador-LM presents a formidable challenge with state-of-the-art models achieving less than 15% of the maximum score on average. Notably, third-grade students average a 43.7% score in the Mathador competition, underscoring LLMs’ current limitations in this domain.
Model Performance Correlation: Performance on Mathador-LM correlates with model size. Models with fewer than 3 billion parameters had negligible accuracy, while models around the 7-8 billion range scored 5-7%. The highest-performing models (70-72 billion parameters) alongside Claude-Opus reached 10-15%. Most notably, prominent models such as GPT4 and Claude-Haiku did not exceed a 7% average score.
Model Variability: Analysis reveals consistent performance across different difficulties of instances highlighting Mathador-LM’s stability in generating dynamic, reproducible benchmarks conducive to rigorous LLM evaluation.

Implications and Future Directions

This work presents significant implications both in the practical evaluation of LLMs and theoretical advancements in AI research. Practically, Mathador-LM provides a new tool that researchers can rely on to obtain unbiased assessments of LLM reasoning capabilities, thus facilitating robust model improvement and verification processes. Theoretically, it challenges prevailing assumptions about LLM proficiency, encouraging advancements in AI architectures to improve cognitive task performance similar to human-like reasoning.

In anticipation of future developments, the paper suggests several evolution pathways for Mathador-LM. These include integration of difficulty level variations, exploring different arithmetical contexts or constraints, and fine-tuning LLMs based on Mathador-derived insights. Moreover, this benchmark can serve as a testing ground for innovative prompting techniques tailored to reasoning-specific tasks, thus potentially raising the standard of LLM's logical reasoning.

Mathador-LM is introduced at a critical juncture where the efficacy of static benchmarks is declining due to their saturation and potential leakage issues. The ongoing development of dynamic benchmarks like Mathador-LM may redefine benchmark strategies, offering a pathway towards more secure, adaptive, and comprehensive evaluation frameworks in the evolving landscape of artificial intelligence.