Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 77 tok/s
Gemini 2.5 Pro 57 tok/s Pro
GPT-5 Medium 34 tok/s Pro
GPT-5 High 34 tok/s Pro
GPT-4o 99 tok/s Pro
Kimi K2 165 tok/s Pro
GPT OSS 120B 450 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

FLAMES: Improving LLM Math Reasoning via a Fine-Grained Analysis of the Data Synthesis Pipeline (2508.16514v1)

Published 22 Aug 2025 in cs.LG, cs.AI, and cs.CL

Abstract: Recent works improving LLM math reasoning with synthetic data have used unique setups, making comparison of data synthesis strategies impractical. This leaves many unanswered questions about the roles of different factors in the synthetic data pipeline, such as the impact of filtering low-quality problems. To address this gap, we introduce FLAMES, a Framework for LLM Assessment of Math rEasoning Data Synthesis, and perform a systematic study of 10 existing data synthesis strategies and multiple other factors impacting the performance of synthetic math reasoning data. Our FLAMES experiments provide several valuable insights about the optimal balance of difficulty and diversity of synthetic data. First, data agents designed to increase problem complexity lead to best improvements on most math metrics. Second, with a fixed data generation budget, keeping higher problem coverage is more important than keeping only problems with reliable solutions. Third, GSM8K- and MATH-based synthetic data can lead to improvements on competition-level benchmarks, showcasing easy-to-hard generalization. Leveraging insights from our FLAMES experiments, we design two novel data synthesis strategies for improving out-of-domain generalization and robustness. Further, we develop the FLAMES dataset, an effective blend of our novel and existing data synthesis strategies, outperforming public datasets on OlympiadBench (+15.7), CollegeMath (+4.5), GSMPlus (+6.5), and MATH (+3.1). Fine-tuning Qwen2.5-Math-7B on the FLAMES dataset achieves 81.4% on MATH, surpassing larger Llama3 405B, GPT-4o and Claude 3.5 Sonnet.

Summary

  • The paper presents FLAMES, which improves LLM math reasoning by dissecting and refining each stage of the data synthesis pipeline.
  • It introduces meticulous quality control and diverse data generation strategies that boost benchmark performance on datasets like GSM8K.
  • The work demonstrates that precise data synthesis directly increases accuracy in step-by-step problem solving and reduces error rates.

FLAMES: Improving LLM Math Reasoning via a Fine-Grained Analysis of the Data Synthesis Pipeline

The paper presents FLAMES, a framework developed to enhance mathematical reasoning in LLMs by conducting a nuanced examination of the data synthesis pipeline. By focusing on the entire lifecycle of data synthesis, from generation to curation and utilization, the framework aims to address critical gaps in the LLM's ability to solve mathematical problems accurately.

Methodology

The paper introduces a multi-faceted framework, FLAMES, which dissects the data synthesis pipeline into discrete components: data generation, quality control, and training integration. The approach is framed around the belief that each stage in the pipeline affects the efficacy of LLMs in mathematical reasoning. The framework notably utilizes various fine-grained analysis methodologies to scrutinize each step, providing a systematic pathway to optimize the contribution of synthesized data to model training.

The authors integrated advanced quality control mechanisms, including data validation and redundancy checks, to ensure the robustness of synthesized data. Furthermore, they explored diverse data generation strategies, from procedural generation based on mathematical problem templates to using existing problem sets enhanced through iterative refinement processes.

Results

FLAMES demonstrated a substantial improvement in the mathematical reasoning capabilities of LLMs across various benchmark datasets. Particularly, the model achieved significant performance gains on the GSM8K dataset, illustrating superior problem-solving abilities compared to baseline models. The fine-tuned LLMs using the FLAMES methodology outperformed prior models by a notable margin, highlighting the importance of a meticulously designed data synthesis pipeline in complex reasoning tasks.

The results reinforced that systematic enhancements in data synthesis directly translate to increased accuracy in mathematical problem-solving, indicating that the model effectively generalizes learning from synthetic data to real-world scenarios. The findings showcased improved accuracy in step-by-step problem decomposition and reduced error rates in arithmetic calculations and logical reasoning.

Implications and Future Work

The findings suggest that detailed assessments and strategic improvements in the data synthesis processes are pivotal for progressing the mathematical reasoning capabilities of LLMs. The research provides a framework that augments existing data generation paradigms with a strong emphasis on precision and validation, setting a standard for future developments in LLM training methodologies.

For future work, the exploration of adaptive data synthesis techniques, particularly leveraging reinforcement learning to dynamically adjust data complexity in response to model performance, could present promising avenues. Additionally, extending FLAMES to accommodate multi-modal data inputs may further enhance the model's utility across diverse reasoning domains.

Conclusion

FLAMES establishes a sophisticated approach to refining the data synthesis pipeline, which in turn enhances LLM performance in mathematical reasoning tasks. Through comprehensive evaluation and rigorous quality control measures, the framework has successfully improved the accuracy and reliability of LLMs in solving complex mathematical problems. This underscores the significance of data synthesis frameworks in advancing the field of AI-driven mathematical reasoning and sets the groundwork for further innovations.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 post and received 1 like.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube