Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving (2407.13690v1)

Published 18 Jun 2024 in cs.CL and cs.AI

Abstract: Solving mathematical problems requires advanced reasoning abilities and presents notable challenges for LLMs. Previous works usually synthesize data from proprietary models to augment existing datasets, followed by instruction tuning to achieve top-tier results. However, our analysis of these datasets reveals severe biases towards easy queries, with frequent failures to generate any correct response for the most challenging queries. Hypothesizing that difficult queries are crucial to learn complex reasoning, we propose Difficulty-Aware Rejection Tuning (DART), a method that allocates difficult queries more trials during the synthesis phase, enabling more extensive training on difficult samples. Utilizing DART, we have created new datasets for mathematical problem-solving that focus more on difficult queries and are substantially smaller than previous ones. Remarkably, our synthesis process solely relies on a 7B-sized open-weight model, without reliance on the commonly used proprietary GPT-4. We fine-tune various base models on our datasets ranging from 7B to 70B in size, resulting in a series of strong models called DART-MATH. In comprehensive in-domain and out-of-domain evaluation on 6 mathematical benchmarks, DART-MATH outperforms vanilla rejection tuning significantly, being superior or comparable to previous arts, despite using much smaller datasets and no proprietary models. Furthermore, our results position our synthetic datasets as the most effective and cost-efficient publicly available resources for advancing mathematical problem-solving.

Overview of "DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving"

The paper under scrutiny, "DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving," presents a novel approach to advance the mathematical problem-solving capabilities of LLMs by addressing the biases in existing datasets that skew towards easier queries. The authors introduce Difficulty-Aware Rejection Tuning (DART), which prioritizes harder queries during the data synthesis process.

Problem Statement and Hypothesis

Mathematical reasoning remains a challenging frontier for LLMs, often due to biases in training datasets that favor easier problems, resulting in inadequate training on more complex queries. The authors hypothesize that this bias inhibits the mathematical reasoning capabilities of these models because difficult queries are essential to developing robust problem-solving skills.

Methodology

DART presents an innovative method that enhances training on difficult mathematical queries by introducing two strategies: Uniform and Proportional to Difficulty (Prop2Diff). Both strategies shift focus towards challenging queries by increasing the number of trials to generate correct responses during data synthesis. The Uniform strategy seeks an equal number of correct responses for all queries, while Prop2Diff adjusts this number based on query difficulty, thereby increasing the representation of more difficult queries.

  • Uniform Strategy: This approach attempts to balance the dataset by ensuring an equal number of correct propositions across the board, thus avoiding the prevalent bias towards easier problems.
  • Prop2Diff Strategy: Seeks to skew data representation towards difficult queries proportionally, thus promoting deeper learning through increased exposure to complex challenges.

The authenticity of generated data under these strategies is ensured by the DeepSeekMath-7B-RL model—a robust non-proprietary alternative to more conventional solutions like GPT-4.

Empirical Findings

The paper reports significant advances using models ranging from 7B to 70B parameters evaluated against six established mathematical benchmarks. DART-Math not only outperforms traditional rejection tuning methods but also yields high efficacy with dataset sizes far smaller than existing datasets like MMIQC and MetaMath. For instance, DART-Math applied to the Llama3-8B model showed a performance jump from 21.2% to 46.6% on the MATH benchmark and from 51.0% to 82.5% on GSM8K, demonstrating substantial improvements in model performance exclusively through enhanced dataset curation.

Implications

The paper suggests notable implications for the development and training of LLMs:

  • Data Efficiency: By focusing on challenging queries, DART-Math generates smaller yet more potent datasets, which yield better training outcomes and reduce dependency on proprietary model-generated data.
  • Model Agnosticism: The success of DART strategies across different base models suggests potential extensions of these methods to a broader array of problem-solving contexts beyond mathematics.
  • Public Resource Value: The availability of DART-Math as an open-source dataset fosters additional research and practical applications without the cost concerns associated with traditional large datasets.

Future Directions

Looking forward, the authors posit that query augmentation could enhance the DART framework further and plan to explore this avenue. Additionally, exploring alternative difficulty metrics beyond the fail rate could refine model training, potentially paving the way for incorporating code-based reasoning capabilities, where biases in problem difficulty might show similar detrimental effects.

In conclusion, "DART-Math" delineates a promising direction towards rectifying bias in mathematical training data, improving model performance, and enhancing the cost-effectiveness of LLMs in mathematical reasoning, marking a substantive step forward in the computational toolset available to researchers and practitioners.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yuxuan Tong (4 papers)
  2. Xiwen Zhang (27 papers)
  3. Rui Wang (996 papers)
  4. Ruidong Wu (3 papers)
  5. Junxian He (66 papers)
Citations (14)