Key-Point-Driven Data Synthesis with its Enhancement on Mathematical Reasoning
Abstract: LLMs have shown great potential in complex reasoning tasks, yet their performance is often hampered by the scarcity of high-quality and reasoning-focused training datasets. Addressing this challenge, we propose Key-Point-Driven Data Synthesis (KPDDS), a novel data synthesis framework that synthesizes question-answer pairs by leveraging key points and exemplar practices from authentic data sources. KPDDS ensures the generation of novel questions with rigorous quality control and substantial scalability. As a result, we present KPMath, an extensive synthetic dataset tailored for mathematical reasoning, comprising over 800K question-answer pairs. Utilizing KPMath and augmenting it with additional reasoning-intensive corpora, we create the comprehensive KPMath-Plus dataset. The Qwen1.5-72B model, fine-tuned on KPMath-Plus, achieves 87.0% PASS@1 accuracy on GSM8K and 58.3% on MATH, surpassing competitors in the 7B to 70B range and best commercial models like GPT-4 across multiple math reasoning datasets.
- Mistral AI. Au large: Mistral large is our flagship model, with top-tier reasoning capacities, February 2024. URL https://mistral.ai/news/mistral-large/.
- Learning from mistakes makes llm better reasoner. arXiv preprint arXiv:2310.20689, 2023.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
- Anthropic. Model card and evaluations for claude models. 2023.
- https://huggingface.co/datasets/hoskinson-center/proof-pile, 2022. URL https://huggingface.co/datasets/hoskinson-center/proof-pile.
- Llemma: An open language model for mathematics. arXiv preprint arXiv:2310.10631, 2023.
- Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
- Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954, 2024.
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
- Using left and right brains together: Towards vision and language planning. arXiv preprint arXiv:2402.10534, 2024.
- Skills-in-context prompting: Unlocking compositionality in large language models. arXiv preprint arXiv:2308.00304, 2023.
- Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588, 2022.
- Contrastive chain-of-thought prompting. arXiv preprint arXiv:2311.09277, 2023.
- Together Computer. Redpajama: An open source recipe to reproduce llama training dataset, April 2023. URL https://github.com/togethercomputer/RedPajama-Data.
- Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023.
- What makes for good visual instructions? synthesizing complex visual reasoning instructions for visual instruction tuning. arXiv preprint arXiv:2311.01487, 2023.
- Pal: Program-aided language models. arXiv preprint arXiv:2211.10435, 2022.
- CRITIC: Large language models can self-correct with tool-interactive critiquing. In The Twelfth International Conference on Learning Representations, 2024a. URL https://openreview.net/forum?id=Sx038qxjek.
- ToRA: A tool-integrated reasoning agent for mathematical problem solving. In The Twelfth International Conference on Learning Representations, 2024b. URL https://openreview.net/forum?id=Ep0TtjVoap.
- Textbooks are all you need. arXiv preprint arXiv:2306.11644, 2023.
- Measuring mathematical problem solving with the math dataset. NeurIPS, 2021.
- hiyouga. Llama factory. https://github.com/hiyouga/LLaMA-Factory, 2023.
- Competition-level problems are effective llm evaluators. arXiv preprint arXiv:2312.02143, 2023.
- Mustard: Mastering uniform synthesis of theorem and proof data. arXiv preprint arXiv:2402.08957, 2024.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- S3eval: A synthetic, scalable, systematic evaluation suite for large language models. arXiv preprint arXiv:2310.15147, 2023.
- Solving quantitative reasoning problems with language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022.
- Synthetic data (almost) from scratch: Generalized instruction tuning for language models. arXiv preprint arXiv:2402.13064, 2024.
- Haoxiong Liu and Andrew Chi-Chih Yao. Augmenting math word problems via iterative question composing. arXiv preprint arXiv:2401.09003, 2024.
- Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583, 2023a.
- Wizardcoder: Empowering code large language models with evol-instruct. CoRR, abs/2306.08568, 2023b. doi: 10.48550/ARXIV.2306.08568. URL https://doi.org/10.48550/arXiv.2306.08568.
- Reft: Reasoning with reinforced fine-tuning. arXiv preprint arXiv:2401.08967, 2024.
- Let’s reward step by step: Step-level reward model as the navigators for reasoning. arXiv preprint arXiv:2310.10080, 2023.
- Lila: A unified benchmark for mathematical reasoning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022.
- OpenAI. Gpt-4 technical report, 2023.
- Keiran Paster, Openwebmath: An open dataset of high-quality mathematical web text. arXiv preprint, forthcoming, 2023a.
- Openwebmath: An open dataset of high-quality mathematical web text. arXiv preprint arXiv:2310.06786, 2023b.
- Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–14, 2021.
- Code llama: Open foundation models for code. CoRR, abs/2308.12950, 2023. doi: 10.48550/ARXIV.2308.12950. URL https://doi.org/10.48550/arXiv.2308.12950.
- Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618, 2023.
- Does synthetic data generation of llms help clinical text mining? arXiv preprint arXiv:2303.04360, 2023.
- Galactica: A large language model for science, 2022.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM, 2023.
- Llama 2: Open foundation and fine-tuned chat models. ArXiv, abs/2307.09288, 2023.
- Solving olympiad geometry without human demonstrations. Nature, 625(7995):476–482, 2024.
- Math-shepherd: A label-free step-by-step verifier for llms in mathematical reasoning. arXiv preprint arXiv:2312.08935, 2023a.
- Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
- Generative ai for math: Part i–mathpile: A billion-token-scale pretraining corpus for math. arXiv preprint arXiv:2312.17120, 2023b.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
- Magicoder: Source code is all you need. arXiv preprint arXiv:2312.02120, 2023.
- Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
- Outcome-supervised verifiers for planning in mathematical reasoning. arXiv preprint arXiv:2311.09724, 2023a.
- Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284, 2023b.
- MAmmoTH: Building math generalist models through hybrid instruction tuning. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=yLClGs770I.
- Evaluating and improving tool-augmented computation-intensive math reasoning. Advances in Neural Information Processing Systems, 36, 2024.
- Alpacare: Instruction-tuned large language models for medical application. arXiv preprint arXiv:2310.14558, 2023a.
- Cumulative reasoning with large language models. arXiv preprint arXiv:2308.04371, 2023b.
- Progressive-hint prompting improves reasoning in large language models. arXiv preprint arXiv:2304.09797, 2023.
- Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification. arXiv preprint arXiv:2308.07921, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.