Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Augmenting Math Word Problems via Iterative Question Composing (2401.09003v5)

Published 17 Jan 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Despite the advancements in LLMs for mathematical reasoning, solving competition-level math problems remains a significant challenge, especially for open-source LLMs without external tools. We introduce the MMIQC dataset, comprising a mixture of processed web data and synthetic question-response pairs, aimed at enhancing the mathematical reasoning capabilities of base LLMs. Models fine-tuned on MMIQC consistently surpass their counterparts in performance on the MATH benchmark across various model sizes. Notably, Qwen-72B-MMIQC achieves a 45.0% accuracy, exceeding the previous open-source state-of-the-art by 8.2% and outperforming the initial version GPT-4 released in 2023. Extensive evaluation results on Hungarian high school finals suggest that such improvement can generalize to unseen data. Our ablation study on MMIQC reveals that a large part of the improvement can be attributed to our novel augmentation method, Iterative Question Composing (IQC), which involves iteratively composing new questions from seed problems using an LLM and applying rejection sampling through another LLM.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. Llemma: An open language model for mathematics. arXiv preprint arXiv:2310.10631, 2023.
  2. Language models are few-shot learners. ArXiv, abs/2005.14165, 2020.
  3. Evaluating large language models trained on code, 2021.
  4. Computer, T. Redpajama: An open source recipe to reproduce llama training dataset, April 2023. URL https://github.com/togethercomputer/RedPajama-Data.
  5. Complexity-based prompting for multi-step reasoning. arXiv preprint arXiv:2210.00720, 2022.
  6. Pal: Program-aided language models. arXiv preprint arXiv:2211.10435, 2022.
  7. Tora: A tool-integrated reasoning agent for mathematical problem solving. arXiv preprint arXiv:2309.17452, 2023.
  8. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2021a.
  9. Measuring mathematical problem solving with the math dataset. NeurIPS, 2021b.
  10. Large language models can self-improve. arXiv preprint arXiv:2210.11610, 2022a.
  11. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. ArXiv, abs/2201.07207, 2022b. URL https://api.semanticscholar.org/CorpusID:246035276.
  12. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  13. Solving quantitative reasoning problems with language models. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022.
  14. Let’s verify step by step. arXiv preprint arXiv:2305.20050, 2023.
  15. Tinygsm: achieving¿ 80% on gsm8k with small language models. arXiv preprint arXiv:2312.09241, 2023.
  16. Sympy: symbolic computing in python. PeerJ Computer Science, 3:e103, January 2017. ISSN 2376-5992. doi: 10.7717/peerj-cs.103. URL https://doi.org/10.7717/peerj-cs.103.
  17. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023. URL https://api.semanticscholar.org/CorpusID:257532815.
  18. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022.
  19. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pp.  1–22, 2023.
  20. Openwebmath: An open dataset of high-quality mathematical web text. arXiv preprint, forthcoming, 2023.
  21. Language models are unsupervised multitask learners. OpenAI Blog, 2019.
  22. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp.  1–16. IEEE, 2020.
  23. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  24. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  25. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023a.
  26. Llama 2: Open foundation and fine-tuned chat models. ArXiv, abs/2307.09288, 2023b. URL https://api.semanticscholar.org/CorpusID:259950998.
  27. Solving math word problems with process- and outcome-based feedback, 2022.
  28. Math-shepherd: A label-free step-by-step verifier for llms in mathematical reasoning. arXiv preprint arXiv:2312.08935, 2023a.
  29. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, 2023b. URL https://openreview.net/forum?id=1PL1NIMMrw.
  30. Chain-of-thought prompting elicits reasoning in large language models, 2023.
  31. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
  32. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284, 2023.
  33. Scaling relationship on learning mathematical reasoning with large language models. arXiv preprint arXiv:2308.01825, 2023.
  34. Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Haoxiong Liu (4 papers)
  2. Andrew Chi-Chih Yao (16 papers)
  3. Yifan Zhang (245 papers)
  4. Yifan Luo (17 papers)
Citations (25)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets