Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ConceptMath: A Bilingual Concept-wise Benchmark for Measuring Mathematical Reasoning of Large Language Models (2402.14660v2)

Published 22 Feb 2024 in cs.CL and cs.AI

Abstract: This paper introduces ConceptMath, a bilingual (English and Chinese), fine-grained benchmark that evaluates concept-wise mathematical reasoning of LLMs. Unlike traditional benchmarks that evaluate general mathematical reasoning with an average accuracy, ConceptMath systematically organizes math problems under a hierarchy of math concepts, so that mathematical reasoning can be evaluated at different granularity with concept-wise accuracies. Based on our ConcepthMath, we evaluate a broad range of LLMs, and we observe existing LLMs, though achieving high average accuracies on traditional benchmarks, exhibit significant performance variations across different math concepts and may even fail catastrophically on the most basic ones. Besides, we also introduce an efficient fine-tuning strategy to enhance the weaknesses of existing LLMs. Finally, we hope ConceptMath could guide the developers to understand the fine-grained mathematical abilities of their models and facilitate the growth of foundation models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Anthropic. 2023. Model card and evaluations for claude models.
  2. Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues. arXiv.
  3. Griprank: Bridging the gap between retrieval and generation via the generative knowledge improved passage ranking. CIKM.
  4. Qwen technical report. arXiv preprint arXiv:2309.16609.
  5. Baichuan. 2023. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305.
  6. Findings of the 2014 workshop on statistical machine translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pages 12–58, Baltimore, Maryland, USA. Association for Computational Linguistics.
  7. xcot: Cross-lingual instruction tuning for cross-lingual chain-of-thought reasoning. arXiv preprint arXiv:2401.07037.
  8. Training verifiers to solve math word problems.
  9. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335.
  10. Aarohi Srivastava et al. 2022. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv: Arxiv-2206.04615.
  11. Development of mathematical concepts as basis for an elaborated mathematical understanding. South African Journal of Childhood Education, 3(1):38–67.
  12. Lvp-m3: language-aware visual prompt for multilingual multimodal machine translation. EMNLP.
  13. M2c: Towards automatic multimodal manga complement. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 9876–9882.
  14. Logformer: A pre-train and tuning pipeline for log anomaly detection. AAAI.
  15. Owl: A large language model for it operations. arXiv preprint arXiv:2309.09298.
  16. Adaptive contrastive knowledge distillation for bert compression. In Findings of the Association for Computational Linguistics: ACL 2023, pages 8941–8953.
  17. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR).
  18. Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
  19. How well do computers solve math word problems? large-scale dataset construction and evaluation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 887–896.
  20. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186.
  21. Parsing algebraic word problems into equations. Transactions of the Association for Computational Linguistics, 3:585–597.
  22. E2-llm: Efficient and extreme length extension of large language models. arXiv preprint arXiv:2401.06951.
  23. Dam: discrepancy alignment metric for face recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3814–3823.
  24. Cross-lingual cross-modal consolidation for effective multilingual video corpus moment retrieval. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 1854–1862.
  25. Block proposal neural architecture search. IEEE Transactions on Image Processing, 30:15–25.
  26. The flan collection: designing data and methods for effective instruction tuning. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org.
  27. Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning. In The Eleventh International Conference on Learning Representations.
  28. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583.
  29. Norman Megill and David A Wheeler. 2019. Metamath: a computer language for mathematical proofs. Lulu. com.
  30. A diverse corpus for evaluating and developing english math word problem solvers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 975–984.
  31. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1797–1807, Brussels, Belgium. Association for Computational Linguistics.
  32. OpenAI. 2023. Gpt-4 technical report. PREPRINT.
  33. Openwebmath: An open dataset of high-quality mathematical web text.
  34. Are nlp models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080–2094.
  35. Language models are unsupervised multitask learners.
  36. Know what you don’t know: Unanswerable questions for squad. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789.
  37. Subhro Roy and Dan Roth. 2015. Solving general arithmetic word problems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1743–1752.
  38. Reasoning about quantities in natural language. Transactions of the Association for Computational Linguistics, 3:1–13.
  39. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.
  40. Detecting pretraining data from large language models. arXiv preprint arXiv:2310.16789.
  41. Martin A Simon. 2011. Studying mathematics conceptual learning: Student learning through their mathematical activity. North American Chapter of the International Group for the Psychology of Mathematics Education.
  42. InternLM Team. 2023a. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM-techreport.
  43. Yi Team. 2023b. Yi: Building the next generation of open-source and bilingual llms. https://github.com/01-ai/Yi.
  44. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  45. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  46. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations.
  47. Deep neural solver for math word problems. In Proceedings of the 2017 conference on empirical methods in natural language processing, pages 845–854.
  48. Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models. arXiv preprint arXiv: 2310.00746.
  49. Generative ai for math: Part i – mathpile: A billion-token-scale pretraining corpus for math. arXiv preprint arXiv:2312.17120.
  50. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  51. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284.
  52. Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv: 2309.05653.
  53. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (13)
  1. Yanan Wu (40 papers)
  2. Jie Liu (492 papers)
  3. Xingyuan Bu (24 papers)
  4. Jiaheng Liu (100 papers)
  5. Zhanhui Zhou (13 papers)
  6. Yuanxing Zhang (30 papers)
  7. Chenchen Zhang (19 papers)
  8. Zhiqi Bai (5 papers)
  9. Haibin Chen (23 papers)
  10. Tiezheng Ge (46 papers)
  11. Wanli Ouyang (358 papers)
  12. Wenbo Su (36 papers)
  13. Bo Zheng (205 papers)
Citations (6)
X Twitter Logo Streamline Icon: https://streamlinehq.com