Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CMATH: Can Your Language Model Pass Chinese Elementary School Math Test? (2306.16636v1)

Published 29 Jun 2023 in cs.CL, cs.AI, and cs.LG

Abstract: We present the Chinese Elementary School Math Word Problems (CMATH) dataset, comprising 1.7k elementary school-level math word problems with detailed annotations, source from actual Chinese workbooks and exams. This dataset aims to provide a benchmark tool for assessing the following question: to what grade level of elementary school math do the abilities of popular LLMs correspond? We evaluate a variety of popular LLMs, including both commercial and open-source options, and discover that only GPT-4 achieves success (accuracy $\geq$ 60\%) across all six elementary school grades, while other models falter at different grade levels. Furthermore, we assess the robustness of several top-performing LLMs by augmenting the original problems in the CMATH dataset with distracting information. Our findings reveal that GPT-4 is able to maintains robustness, while other model fail. We anticipate that our study will expose limitations in LLMs' arithmetic and reasoning capabilities, and promote their ongoing development and advancement.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. Mathqa: Towards interpretable math word problem solving with operation-based formalisms.
  2. Baichuan inc. 2023. Baichuan-7B. https://github.com/baichuan-inc/baichuan-7B/blob/main/README_EN.md.
  3. Sparks of artificial general intelligence: Early experiments with gpt-4.
  4. Training verifiers to solve math word problems.
  5. Efficient and effective text encoding for chinese llama and alpaca. arXiv preprint arXiv:2304.08177.
  6. GLM: general language model pretraining with autoregressive blank infilling. pages 320–335.
  7. Measuring massive multitask language understanding.
  8. Measuring mathematical problem solving with the math dataset.
  9. LoRA: Low-rank adaptation of large language models.
  10. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322.
  11. IDEA-CCNL. 2023. Idea-ccnl/ziya-llama-13b-v1. https://huggingface.co/IDEA-CCNL/Ziya-LLaMA-13B-v1/blob/main/README.md.
  12. Codegen: An open large language model for code with multi-turn program synthesis.
  13. OpenAI. 2023. GPT-4 technical report.
  14. Training language models to follow instructions with human feedback.
  15. Bo Peng. 2023. RWKV-4-raven. https://huggingface.co/BlinkDL/rwkv-4-raven.
  16. RWKV: Reinventing RNNs for the transformer era.
  17. Tianxiang Sun and Xipeng Qiu. 2023. MOSS. Github. https://github.com/OpenLMLab/MOSS/blob/main/README_en.md.
  18. Galactica: A large language model for science.
  19. THUDM. 2023a. ChatGLM-6B. https://github.com/THUDM/ChatGLM-6B/blob/main/README_en.md.
  20. THUDM. 2023b. ChatGLM2-6B. https://github.com/THUDM/ChatGLM2-6B/blob/main/README_EN.md.
  21. Deep neural solver for math word problems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 845–854, Copenhagen, Denmark. Association for Computational Linguistics.
  22. Chain-of-thought prompting elicits reasoning in large language models.
  23. GLM-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414.
  24. Ape210k: A large-scale and template-rich dataset of math word problems.
  25. Agieval: A human-centric benchmark for evaluating foundation models.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Tianwen Wei (20 papers)
  2. Jian Luan (51 papers)
  3. Wei Liu (1135 papers)
  4. Shuang Dong (1 paper)
  5. Bin Wang (750 papers)
Citations (22)