Papers
Topics
Authors
Recent
Search
2000 character limit reached

Key-Point-Driven Data Synthesis with its Enhancement on Mathematical Reasoning

Published 4 Mar 2024 in cs.CL and cs.AI | (2403.02333v3)

Abstract: LLMs have shown great potential in complex reasoning tasks, yet their performance is often hampered by the scarcity of high-quality and reasoning-focused training datasets. Addressing this challenge, we propose Key-Point-Driven Data Synthesis (KPDDS), a novel data synthesis framework that synthesizes question-answer pairs by leveraging key points and exemplar practices from authentic data sources. KPDDS ensures the generation of novel questions with rigorous quality control and substantial scalability. As a result, we present KPMath, an extensive synthetic dataset tailored for mathematical reasoning, comprising over 800K question-answer pairs. Utilizing KPMath and augmenting it with additional reasoning-intensive corpora, we create the comprehensive KPMath-Plus dataset. The Qwen1.5-72B model, fine-tuned on KPMath-Plus, achieves 87.0% PASS@1 accuracy on GSM8K and 58.3% on MATH, surpassing competitors in the 7B to 70B range and best commercial models like GPT-4 across multiple math reasoning datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. Mistral AI. Au large: Mistral large is our flagship model, with top-tier reasoning capacities, February 2024. URL https://mistral.ai/news/mistral-large/.
  2. Learning from mistakes makes llm better reasoner. arXiv preprint arXiv:2310.20689, 2023.
  3. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  4. Anthropic. Model card and evaluations for claude models. 2023.
  5. https://huggingface.co/datasets/hoskinson-center/proof-pile, 2022. URL https://huggingface.co/datasets/hoskinson-center/proof-pile.
  6. Llemma: An open language model for mathematics. arXiv preprint arXiv:2310.10631, 2023.
  7. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
  8. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954, 2024.
  9. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  10. Using left and right brains together: Towards vision and language planning. arXiv preprint arXiv:2402.10534, 2024.
  11. Skills-in-context prompting: Unlocking compositionality in large language models. arXiv preprint arXiv:2308.00304, 2023.
  12. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588, 2022.
  13. Contrastive chain-of-thought prompting. arXiv preprint arXiv:2311.09277, 2023.
  14. Together Computer. Redpajama: An open source recipe to reproduce llama training dataset, April 2023. URL https://github.com/togethercomputer/RedPajama-Data.
  15. Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023.
  16. What makes for good visual instructions? synthesizing complex visual reasoning instructions for visual instruction tuning. arXiv preprint arXiv:2311.01487, 2023.
  17. Pal: Program-aided language models. arXiv preprint arXiv:2211.10435, 2022.
  18. CRITIC: Large language models can self-correct with tool-interactive critiquing. In The Twelfth International Conference on Learning Representations, 2024a. URL https://openreview.net/forum?id=Sx038qxjek.
  19. ToRA: A tool-integrated reasoning agent for mathematical problem solving. In The Twelfth International Conference on Learning Representations, 2024b. URL https://openreview.net/forum?id=Ep0TtjVoap.
  20. Textbooks are all you need. arXiv preprint arXiv:2306.11644, 2023.
  21. Measuring mathematical problem solving with the math dataset. NeurIPS, 2021.
  22. hiyouga. Llama factory. https://github.com/hiyouga/LLaMA-Factory, 2023.
  23. Competition-level problems are effective llm evaluators. arXiv preprint arXiv:2312.02143, 2023.
  24. Mustard: Mastering uniform synthesis of theorem and proof data. arXiv preprint arXiv:2402.08957, 2024.
  25. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  26. S3eval: A synthetic, scalable, systematic evaluation suite for large language models. arXiv preprint arXiv:2310.15147, 2023.
  27. Solving quantitative reasoning problems with language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022.
  28. Synthetic data (almost) from scratch: Generalized instruction tuning for language models. arXiv preprint arXiv:2402.13064, 2024.
  29. Haoxiong Liu and Andrew Chi-Chih Yao. Augmenting math word problems via iterative question composing. arXiv preprint arXiv:2401.09003, 2024.
  30. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583, 2023a.
  31. Wizardcoder: Empowering code large language models with evol-instruct. CoRR, abs/2306.08568, 2023b. doi: 10.48550/ARXIV.2306.08568. URL https://doi.org/10.48550/arXiv.2306.08568.
  32. Reft: Reasoning with reinforced fine-tuning. arXiv preprint arXiv:2401.08967, 2024.
  33. Let’s reward step by step: Step-level reward model as the navigators for reasoning. arXiv preprint arXiv:2310.10080, 2023.
  34. Lila: A unified benchmark for mathematical reasoning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022.
  35. OpenAI. Gpt-4 technical report, 2023.
  36. Keiran Paster, Openwebmath: An open dataset of high-quality mathematical web text. arXiv preprint, forthcoming, 2023a.
  37. Openwebmath: An open dataset of high-quality mathematical web text. arXiv preprint arXiv:2310.06786, 2023b.
  38. Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp.  1–14, 2021.
  39. Code llama: Open foundation models for code. CoRR, abs/2308.12950, 2023. doi: 10.48550/ARXIV.2308.12950. URL https://doi.org/10.48550/arXiv.2308.12950.
  40. Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618, 2023.
  41. Does synthetic data generation of llms help clinical text mining? arXiv preprint arXiv:2303.04360, 2023.
  42. Galactica: A large language model for science, 2022.
  43. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  44. InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM, 2023.
  45. Llama 2: Open foundation and fine-tuned chat models. ArXiv, abs/2307.09288, 2023.
  46. Solving olympiad geometry without human demonstrations. Nature, 625(7995):476–482, 2024.
  47. Math-shepherd: A label-free step-by-step verifier for llms in mathematical reasoning. arXiv preprint arXiv:2312.08935, 2023a.
  48. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
  49. Generative ai for math: Part i–mathpile: A billion-token-scale pretraining corpus for math. arXiv preprint arXiv:2312.17120, 2023b.
  50. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  51. Magicoder: Source code is all you need. arXiv preprint arXiv:2312.02120, 2023.
  52. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
  53. Outcome-supervised verifiers for planning in mathematical reasoning. arXiv preprint arXiv:2311.09724, 2023a.
  54. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284, 2023b.
  55. MAmmoTH: Building math generalist models through hybrid instruction tuning. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=yLClGs770I.
  56. Evaluating and improving tool-augmented computation-intensive math reasoning. Advances in Neural Information Processing Systems, 36, 2024.
  57. Alpacare: Instruction-tuned large language models for medical application. arXiv preprint arXiv:2310.14558, 2023a.
  58. Cumulative reasoning with large language models. arXiv preprint arXiv:2308.04371, 2023b.
  59. Progressive-hint prompting improves reasoning in large language models. arXiv preprint arXiv:2304.09797, 2023.
  60. Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification. arXiv preprint arXiv:2308.07921, 2023.
Citations (20)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 9 likes about this paper.