Papers
Topics
Authors
Recent
Search
2000 character limit reached

CMATH: Can Your Language Model Pass Chinese Elementary School Math Test?

Published 29 Jun 2023 in cs.CL, cs.AI, and cs.LG | (2306.16636v1)

Abstract: We present the Chinese Elementary School Math Word Problems (CMATH) dataset, comprising 1.7k elementary school-level math word problems with detailed annotations, source from actual Chinese workbooks and exams. This dataset aims to provide a benchmark tool for assessing the following question: to what grade level of elementary school math do the abilities of popular LLMs correspond? We evaluate a variety of popular LLMs, including both commercial and open-source options, and discover that only GPT-4 achieves success (accuracy $\geq$ 60\%) across all six elementary school grades, while other models falter at different grade levels. Furthermore, we assess the robustness of several top-performing LLMs by augmenting the original problems in the CMATH dataset with distracting information. Our findings reveal that GPT-4 is able to maintains robustness, while other model fail. We anticipate that our study will expose limitations in LLMs' arithmetic and reasoning capabilities, and promote their ongoing development and advancement.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. Mathqa: Towards interpretable math word problem solving with operation-based formalisms.
  2. Baichuan inc. 2023. Baichuan-7B. https://github.com/baichuan-inc/baichuan-7B/blob/main/README_EN.md.
  3. Sparks of artificial general intelligence: Early experiments with gpt-4.
  4. Training verifiers to solve math word problems.
  5. Efficient and effective text encoding for chinese llama and alpaca. arXiv preprint arXiv:2304.08177.
  6. GLM: general language model pretraining with autoregressive blank infilling. pages 320–335.
  7. Measuring massive multitask language understanding.
  8. Measuring mathematical problem solving with the math dataset.
  9. LoRA: Low-rank adaptation of large language models.
  10. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322.
  11. IDEA-CCNL. 2023. Idea-ccnl/ziya-llama-13b-v1. https://huggingface.co/IDEA-CCNL/Ziya-LLaMA-13B-v1/blob/main/README.md.
  12. Codegen: An open large language model for code with multi-turn program synthesis.
  13. OpenAI. 2023. GPT-4 technical report.
  14. Training language models to follow instructions with human feedback.
  15. Bo Peng. 2023. RWKV-4-raven. https://huggingface.co/BlinkDL/rwkv-4-raven.
  16. RWKV: Reinventing RNNs for the transformer era.
  17. Tianxiang Sun and Xipeng Qiu. 2023. MOSS. Github. https://github.com/OpenLMLab/MOSS/blob/main/README_en.md.
  18. Galactica: A large language model for science.
  19. THUDM. 2023a. ChatGLM-6B. https://github.com/THUDM/ChatGLM-6B/blob/main/README_en.md.
  20. THUDM. 2023b. ChatGLM2-6B. https://github.com/THUDM/ChatGLM2-6B/blob/main/README_EN.md.
  21. Deep neural solver for math word problems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 845–854, Copenhagen, Denmark. Association for Computational Linguistics.
  22. Chain-of-thought prompting elicits reasoning in large language models.
  23. GLM-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414.
  24. Ape210k: A large-scale and template-rich dataset of math word problems.
  25. Agieval: A human-centric benchmark for evaluating foundation models.
Citations (22)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.