CMATH: Can Your Language Model Pass Chinese Elementary School Math Test? (2306.16636v1)
Abstract: We present the Chinese Elementary School Math Word Problems (CMATH) dataset, comprising 1.7k elementary school-level math word problems with detailed annotations, source from actual Chinese workbooks and exams. This dataset aims to provide a benchmark tool for assessing the following question: to what grade level of elementary school math do the abilities of popular LLMs correspond? We evaluate a variety of popular LLMs, including both commercial and open-source options, and discover that only GPT-4 achieves success (accuracy $\geq$ 60\%) across all six elementary school grades, while other models falter at different grade levels. Furthermore, we assess the robustness of several top-performing LLMs by augmenting the original problems in the CMATH dataset with distracting information. Our findings reveal that GPT-4 is able to maintains robustness, while other model fail. We anticipate that our study will expose limitations in LLMs' arithmetic and reasoning capabilities, and promote their ongoing development and advancement.
- Mathqa: Towards interpretable math word problem solving with operation-based formalisms.
- Baichuan inc. 2023. Baichuan-7B. https://github.com/baichuan-inc/baichuan-7B/blob/main/README_EN.md.
- Sparks of artificial general intelligence: Early experiments with gpt-4.
- Training verifiers to solve math word problems.
- Efficient and effective text encoding for chinese llama and alpaca. arXiv preprint arXiv:2304.08177.
- GLM: general language model pretraining with autoregressive blank infilling. pages 320–335.
- Measuring massive multitask language understanding.
- Measuring mathematical problem solving with the math dataset.
- LoRA: Low-rank adaptation of large language models.
- C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322.
- IDEA-CCNL. 2023. Idea-ccnl/ziya-llama-13b-v1. https://huggingface.co/IDEA-CCNL/Ziya-LLaMA-13B-v1/blob/main/README.md.
- Codegen: An open large language model for code with multi-turn program synthesis.
- OpenAI. 2023. GPT-4 technical report.
- Training language models to follow instructions with human feedback.
- Bo Peng. 2023. RWKV-4-raven. https://huggingface.co/BlinkDL/rwkv-4-raven.
- RWKV: Reinventing RNNs for the transformer era.
- Tianxiang Sun and Xipeng Qiu. 2023. MOSS. Github. https://github.com/OpenLMLab/MOSS/blob/main/README_en.md.
- Galactica: A large language model for science.
- THUDM. 2023a. ChatGLM-6B. https://github.com/THUDM/ChatGLM-6B/blob/main/README_en.md.
- THUDM. 2023b. ChatGLM2-6B. https://github.com/THUDM/ChatGLM2-6B/blob/main/README_EN.md.
- Deep neural solver for math word problems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 845–854, Copenhagen, Denmark. Association for Computational Linguistics.
- Chain-of-thought prompting elicits reasoning in large language models.
- GLM-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414.
- Ape210k: A large-scale and template-rich dataset of math word problems.
- Agieval: A human-centric benchmark for evaluating foundation models.
- Tianwen Wei (20 papers)
- Jian Luan (51 papers)
- Wei Liu (1135 papers)
- Shuang Dong (1 paper)
- Bin Wang (750 papers)