Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Unleashing LLM Reasoning Capability via Scalable Question Synthesis from Scratch (2410.18693v2)

Published 24 Oct 2024 in cs.CL and cs.AI

Abstract: Improving the mathematical reasoning capabilities of LLMs is critical for advancing artificial intelligence. However, access to extensive, diverse, and high-quality reasoning datasets remains a significant challenge, particularly for the open-source community. In this paper, we propose ScaleQuest, a novel, scalable, and cost-effective data synthesis method that enables the generation of large-scale mathematical reasoning datasets using lightweight 7B-scale models. ScaleQuest introduces a two-stage question-tuning process comprising Question Fine-Tuning (QFT) and Question Preference Optimization (QPO) to unlock the question generation capabilities of problem-solving models. By generating diverse questions from scratch -- without relying on powerful proprietary models or seed data -- we produce a dataset of 1 million problem-solution pairs. Our experiments demonstrate that models trained on our data outperform existing open-source datasets in both in-domain and out-of-domain evaluations. Furthermore, our approach shows continued performance improvement as the volume of training data increases, highlighting its potential for ongoing data scaling. The extensive improvements observed in code reasoning tasks demonstrate the generalization capabilities of our proposed method. Our work provides the open-source community with a practical solution to enhance the mathematical reasoning abilities of LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Llemma: An open language model for mathematics. arXiv preprint arXiv:2310.10631, 2023.
  2. Internlm2 technical report. arXiv preprint arXiv:2403.17297, 2024.
  3. Skills-in-context prompting: Unlocking compositionality in large language models. arXiv preprint arXiv:2308.00304, 2023.
  4. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588, 2022.
  5. Contrastive chain-of-thought prompting. arXiv preprint arXiv:2311.09277, 2023.
  6. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  7. Metacognitive capabilities of llms: An exploration in mathematical problem solving. arXiv preprint arXiv:2405.12205, 2024.
  8. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
  9. Reformatted alignment. arXiv preprint arXiv:2402.12219, 2024.
  10. Pal: Program-aided language models. In International Conference on Machine Learning, pp.  10764–10799. PMLR, 2023.
  11. Tora: A tool-integrated reasoning agent for mathematical problem solving. arXiv preprint arXiv:2309.17452, 2023.
  12. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. arXiv preprint arXiv:2402.14008, 2024.
  13. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.
  14. Key-point-driven data synthesis with its enhancement on mathematical reasoning. arXiv preprint arXiv:2403.02333, 2024a.
  15. Mustard: Mastering uniform synthesis of theorem and proof data. arXiv preprint arXiv:2402.08957, 2024b.
  16. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674, 2023.
  17. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  18. Efficient sequence packing without cross-contamination: Accelerating large language models without impacting performance. arXiv preprint arXiv:2107.02027, 2021.
  19. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
  20. Step-dpo: Step-wise preference optimization for long-chain reasoning of llms. arXiv preprint arXiv:2406.18629, 2024.
  21. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35:3843–3857, 2022.
  22. Common 7b language models already possess strong math capabilities. arXiv preprint arXiv:2403.04706, 2024a.
  23. Synthetic data (almost) from scratch: Generalized instruction tuning for language models. arXiv preprint arXiv:2402.13064, 2024b.
  24. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions. Hugging Face repository, 2024c.
  25. Haoxiong Liu and Andrew Chi-Chih Yao. Augmenting math word problems via iterative question composing. arXiv preprint arXiv:2401.09003, 2024.
  26. Mathgenie: Generating synthetic data with question back-translation for enhancing mathematical reasoning of llms. arXiv preprint arXiv:2402.16352, 2024.
  27. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583, 2023.
  28. Reft: Reasoning with reinforced fine-tuning. arXiv preprint arXiv:2401.08967, 2024.
  29. Let’s reward step by step: Step-level reward model as the navigators for reasoning. arXiv preprint arXiv:2310.10080, 2023.
  30. Orca-math: Unlocking the potential of slms in grade school math. arXiv preprint arXiv:2402.14830, 2024.
  31. Ray: A distributed framework for emerging {{\{{AI}}\}} applications. In 13th USENIX symposium on operating systems design and implementation (OSDI 18), pp.  561–577, 2018.
  32. Bias in data-driven artificial intelligence systems—an introductory survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 10(3):e1356, 2020.
  33. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.
  34. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024.
  35. Ai-assisted generation of difficult math questions. arXiv preprint arXiv:2407.21009, 2024.
  36. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024.
  37. Mathscale: Scaling instruction tuning for mathematical reasoning. arXiv preprint arXiv:2403.02884, 2024.
  38. Dart-math: Difficulty-aware rejection tuning for mathematical problem-solving. arXiv preprint arXiv:2407.13690, 2024.
  39. Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
  40. Mathcoder: Seamless code integration in llms for enhanced mathematical reasoning. arXiv preprint arXiv:2310.03731, 2023.
  41. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
  42. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
  43. Magpie: Alignment data synthesis from scratch by prompting aligned llms with nothing. arXiv preprint arXiv:2406.08464, 2024.
  44. Qwen2 technical report. arXiv preprint arXiv:2407.10671, 2024a.
  45. Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement. arXiv preprint arXiv:2409.12122, 2024b.
  46. Mumath-code: Combining tool-use large language models with multi-perspective data augmentation for mathematical reasoning. arXiv preprint arXiv:2405.07551, 2024.
  47. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284, 2023a.
  48. Bag of tricks for training data extraction from language models. In International Conference on Machine Learning, pp.  40306–40320. PMLR, 2023b.
  49. Scaling relationship on learning mathematical reasoning with large language models. arXiv preprint arXiv:2308.01825, 2023.
  50. Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653, 2023.
  51. Mammoth2: Scaling instructions from the web. arXiv preprint arXiv:2405.03548, 2024.
  52. Automatic instruction evolving for large language models. arXiv preprint arXiv:2406.00770, 2024.
  53. Evaluating and improving tool-augmented computation-intensive math reasoning. Advances in Neural Information Processing Systems, 36, 2024.
  54. Cumulative reasoning with large language models. arXiv preprint arXiv:2308.04371, 2023.
  55. Wildchat: 1m chatgpt interaction logs in the wild. arXiv preprint arXiv:2405.01470, 2024.
  56. Jiuzhang3. 0: Efficiently improving mathematical reasoning by training small data synthesis models. arXiv preprint arXiv:2405.14365, 2024.
Citations (1)

Summary

  • The paper introduces ScaleQuest, a scalable and cost-effective method using smaller models to synthesize high-quality mathematical reasoning data, unlike approaches relying on large proprietary models.
  • Using the ScaleQuest dataset to fine-tune open-source LLMs resulted in substantial performance improvements (29.2-46.4%) on the MATH benchmark, even surpassing proprietary models.
  • ScaleQuest offers a significant advancement for the open-source AI community by providing a cost-effective way to generate large-scale reasoning datasets, enabling performance gains without vast resources.

Unleashing Reasoning Capability of LLMs via Scalable Question Synthesis from Scratch

The paper "Unleashing Reasoning Capability of LLMs via Scalable Question Synthesis from Scratch" presents a paper focused on enhancing the reasoning capabilities of LLMs through innovative data synthesis techniques, specifically targeting mathematical reasoning tasks. The authors introduce a novel methodology, termed ScaleQuest, which is aimed at addressing the deficiencies in the availability of high-quality, large-scale reasoning datasets, which are crucial for the effective instruction tuning of LLMs.

Methodology

The central contribution of this research is ScaleQuest, a scalable and cost-efficient method for generating high-quality reasoning data utilizing smaller and open-source models. Unlike other techniques that depend heavily on powerful proprietary models such as GPT-4 for data generation, ScaleQuest efficiently harnesses smaller models (7B parameter scale), thus keeping costs manageable.

The method involves several key steps:

  1. Question Generation from Scratch: This involves leveraging the context-free capability of causal LLMs to autonomously generate questions.
  2. Question Fine-Tuning (QFT): The models are initially fine-tuned on a small subset of questions to activate their question generation potential without introducing overfitting.
  3. Question Preference Optimization (QPO): A novel two-stage optimization involving Question Fine-Tuning and Question Preference Optimization is employed to enhance the solvability and difficulty of questions. External models like GPT-4o-mini are used during QPO to optimize the questions further by focusing on their clarity and the appropriateness of their difficulty levels.
  4. Filtering and Response Generation: The generated questions undergo rigorous filtering processes to ensure solvability and linguistic clarity. The paper introduces a reward-model-based filtering process for selecting high-quality answers to the generated questions.

Results

Their approach yielded a dataset containing 1 million high-quality problem-solution pairs. When utilized to fine-tune mainstream open-source models such as Mistral and Llama3, significant performance improvements were observed. The improvements ranged between 29.2% and 46.4% over existing datasets on the MATH benchmark.

Notably, the paper reports that fine-tuning the Qwen2-Math-7B-Base model using ScaleQuest's dataset demonstrated performance surpassing well-known proprietary models, including GPT-4-Turbo and Claude-3.5 Sonnet, without prior preference optimization processes that these proprietary models commonly use.

Implications and Future Directions

The paper underscores the potential for effectively utilizing smaller models and a strategic synthesis approach to create robust reasoning datasets that enhance the baseline performances of LLMs. The ability to generate this high-quality data cost-effectively represents a significant advancement for the open-source community, who often lack access to the extensive resources available to proprietary model developers.

Considering future developments, the methodology proposed by ScaleQuest could be adapted for a broader range of reasoning tasks beyond mathematical problem-solving, such as scientific reasoning or competitive programming challenges. This extension would encompass tasks requiring diverse reasoning and solution paths, potentially broadening the applicability of advanced LLMs to multiple complex, domain-specific scenarios.

Moreover, iterative refinement of the data generation and filtration processes could further optimize the quality and diversity of the datasets, ultimately enhancing the self-improvement capabilities of LLMs. Future works could also explore the integration of broader and more diverse data sources, improving the adaptability and robustness of the models in handling complex and nuanced reasoning tasks.

In conclusion, the research presented in this paper delivers a scalable, cost-effective framework for reasoning data synthesis, which stands to significantly benefit open-source AI development and the wider AI community's efforts in advancing LLMs' capabilities.