MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark (2405.12209v1)
Abstract: Recent advancements in LLMs have showcased significant improvements in mathematics. However, traditional math benchmarks like GSM8k offer a unidimensional perspective, falling short in providing a holistic assessment of the LLMs' math capabilities. To address this gap, we introduce MathBench, a new benchmark that rigorously assesses the mathematical capabilities of LLMs. MathBench spans a wide range of mathematical disciplines, offering a detailed evaluation of both theoretical understanding and practical problem-solving skills. The benchmark progresses through five distinct stages, from basic arithmetic to college mathematics, and is structured to evaluate models at various depths of knowledge. Each stage includes theoretical questions and application problems, allowing us to measure a model's mathematical proficiency and its ability to apply concepts in practical scenarios. MathBench aims to enhance the evaluation of LLMs' mathematical abilities, providing a nuanced view of their knowledge understanding levels and problem solving skills in a bilingual context. The project is released at https://github.com/open-compass/MathBench .
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Mathqa: Towards interpretable math word problem solving with operation-based formalisms.
- Llemma: An open language model for mathematics.
- Qwen technical report. arXiv preprint arXiv:2309.16609.
- Training verifiers to solve math word problems. CoRR, abs/2110.14168.
- OpenCompass Contributors. 2023. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass.
- Deepseek llm: Scaling open-source language models with longtermism.
- Opendatalab: Empowering general artificial intelligence with open datasets. https://opendatalab.com.
- Measuring massive multitask language understanding.
- Measuring mathematical problem solving with the math dataset. Cornell University - arXiv,Cornell University - arXiv.
- Learning to solve arithmetic word problems with verb categorization. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).
- C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models.
- Mixtral of experts.
- Scaling laws for neural language models.
- Parsing algebraic word problems into equations. Transactions of the Association for Computational Linguistics, 3:585–597.
- Learning to automatically solve algebra word problems. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 271–281.
- Program induction by rationale generation: Learning to solve and explain algebraic word problems. arXiv preprint arXiv:1705.04146.
- Mathematical language models: A survey.
- Mmbench: Is your multi-modal model an all-around player?
- A diverse corpus for evaluating and developing english math word problem solvers.
- Lila: A unified benchmark for mathematical reasoning.
- Numglue: A suite of fundamental yet challenging mathematical reasoning tasks.
- Subhro Roy and Dan Roth. 2016. Solving general arithmetic word problems.
- Reasoning about Quantities in Natural Language. Transactions of the Association for Computational Linguistics, 3:1–13.
- Analysing mathematical reasoning abilities of neural models.
- Deepseekmath: Pushing the limits of mathematical reasoning in open language models.
- InternLM Team. 2023a. Internlm: A multilingual language model with progressively enhanced capabilities.
- Lagent Developer Team. 2023b. Lagent: InternLM a lightweight open-source framework that allows users to efficiently build large language model(llm)-based agents. https://github.com/InternLM/lagent.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Shyam Upadhyay and Ming-Wei Chang. 2017. Annotating derivations: A new evaluation strategy and dataset for algebra word problems.
- Scibench: Evaluating college-level scientific problem-solving abilities of large language models.
- Deep neural solver for math word problems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 845–854, Copenhagen, Denmark. Association for Computational Linguistics.
- Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305.
- React: Synergizing reasoning and acting in language models.
- Internlm-math: Open math large language models toward verifiable reasoning.
- Metamath: Bootstrap your own mathematical questions for large language models.
- How well do large language models perform in arithmetic tasks?
- Mammoth: Building math generalist models through hybrid instruction tuning.
- Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414.
- Zilong Zheng (63 papers)
- Yuxuan Qiao (4 papers)
- Haodong Duan (55 papers)
- Zhiwei Fei (4 papers)
- Fengzhe Zhou (7 papers)
- Wenwei Zhang (77 papers)
- Songyang Zhang (116 papers)
- Dahua Lin (336 papers)
- Kai Chen (512 papers)
- HongWei Liu (108 papers)