Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MDPO: Multi-Granularity Direct Preference Optimization for Mathematical Reasoning (2506.15706v1)

Published 30 May 2025 in cs.LG and cs.AI

Abstract: Mathematical reasoning presents a significant challenge for LLMs as it requires ensuring the correctness of each reasoning step. Researchers have been strengthening the mathematical reasoning abilities of LLMs through supervised fine-tuning, but due to the inability to suppress incorrect outputs, illusions can easily arise. Recently, Direct Preference Optimization (DPO) has been widely adopted for aligning human intent by using preference data to prevent LLMs from generating incorrect outputs. However, it has shown limited benefits in long-chain mathematical reasoning, mainly because DPO struggles to effectively capture the differences between accepted and rejected answers from preferences in long-chain data. The inconsistency between DPO training and LLMs' generation metrics also affects the effectiveness of suppressing incorrect outputs. We propose the Multi-Granularity Direct Preference Optimization (MDPO) method, optimizing the mathematical reasoning of LLMs at three granularities: Solution2Solution, Inference2Inference, and Step2Step. Solution2Solution focuses on the correctness of entire long-chain reasoning; Inference2Inference concentrates on logical reasoning between steps; Step2Step corrects computational errors in steps, enhancing the computational capabilities of LLMs. Additionally, we unify the training objectives of the three granularities to align with the generation metrics. We conducted experiments on the open-source models Qwen2 and Llama3, achieving improvements of 1.7% and 0.9% on the GSM8K dataset, and 2.3% and 1.2% on the MATH dataset, outperforming DPO and other DPO variant methods. Furthermore, we also provide a pipeline for constructing MDPO training data that is simple and does not require manual annotation costs.

This paper introduces Multi-Granularity Direct Preference Optimization (MDPO), a method designed to enhance the mathematical reasoning capabilities of LLMs. The core problem addressed is that LLMs often struggle with long-chain mathematical reasoning, where errors in any single step can lead to incorrect final answers. While Supervised Fine-Tuning (SFT) can improve these abilities, it often leads to hallucinations and doesn't effectively suppress incorrect outputs. Direct Preference Optimization (DPO) has been effective for general alignment but shows limited benefits in complex mathematical reasoning, as it struggles to pinpoint specific errors in long solution chains and its training objective can be inconsistent with generation metrics.

MDPO proposes to optimize LLMs at three distinct granularities, providing more targeted supervision signals:

  1. Solution2Solution (Sol2Sol): This level operates on the entire reasoning chain (solution). It provides coarse-grained supervision by comparing complete correct solutions (ywy_w) with complete incorrect solutions (yly_l) for a given problem (xx). This is similar to standard DPO.
  2. Inference2Inference (Infer2Infer): This level focuses on the logical transitions between individual reasoning steps. An "inference" is defined as the generation from stepkstep_k to stepk+1step_{k+1}. If a particular inference leads to a higher error rate or an incorrect path, it's labeled inferloseinfer_{lose}. A corrected or alternative successful inference is inferwininfer_{win}. This provides fine-grained supervision for the reasoning process.
  3. Step2Step: This level targets computational errors within a single reasoning step. If a stepklosestep_k^{lose} contains a calculation mistake, a corrected stepkwinstep_k^{win} is provided. This aims to directly improve the model's computational accuracy.

A key aspect of MDPO is its unified training objective, inspired by Simple Preference Optimization (SimPO) (Lai et al., 26 Jun 2024 ). The objective aims to align the fine-tuning process with the downstream generation metrics. The mathematical reasoning task is framed as a text completion task: given a problem xx and the preceding kk correct steps s0k1s_{0 \sim k-1}, the model must generate the remaining steps to reach the correct answer. This applies to all three granularities:

  • Sol2Sol: k=0k=0, the model generates the entire solution from "Let’s think step by step."
  • Infer2Infer & Step2Step: Given xx and s0i1s_{0 \sim i-1}, the model generates stepistep_i and subsequent steps.

The MDPO loss function is:

LMDPO(πθ)=E(x,s0k1,yw,yl)D[logσ(βywlogπθ(yw(x,s0k1))βyllogπθ(yl(x,s0k1))γ)]\mathcal{L}_{\text{MDPO}}(\pi_\theta) = -\mathbb{E}_{(x,s_{0\sim k-1}, y_w, y_l) \sim \mathcal{D}} \left[ \log \sigma \left( \frac{\beta}{|y_w|} \log \pi_\theta(y_w \mid (x,s_{0\sim k-1})) - \frac{\beta}{|y_l|} \log \pi_\theta(y_l \mid (x,s_{0\sim k-1})) - \gamma \right) \right]

where xx is the problem, s0k1s_{0 \sim k-1} are correct preceding steps, ywy_w is the preferred continuation, yly_l is the rejected continuation, β\beta is a scaling factor for the reward difference, and γ\gamma is a target reward margin. The reward for a sequence yy given context (x,s0k1)(x, s_{0 \sim k-1}) is defined as its average log-likelihood, normalized by length: r(x,s0k1,y)=βyi=1ylogπθ(yix,s0k1,y<i)r(x, s_{0 \sim k-1}, y) = \frac{\beta}{|y|} \sum_{i=1}^{|y|} \log \pi_\theta(y_i \mid x, s_{0 \sim k-1}, y_{<i}).

The paper also outlines a pipeline for automatically constructing the multi-granularity preference data without manual annotation:

  • Sol2Sol Data:

1. Use an LLM to generate multiple reasoning paths for each problem, prepending "[Step i]" to each step. 2. Verify paths based on dataset labels. 3. Select paths with correct final answers as ywy_w and incorrect ones as yly_l. 4. Prioritize problems where the model generates both correct and incorrect solutions.

  • Infer2Infer Data:

1. Use erroneous reasoning paths from Sol2Sol. 2. Segment paths into steps and create windows W=(w0,,wi)W = (w_0, \dots, w_i), where wi=(step0,,stepi)w_i = (step_0, \dots, step_i). 3. For each window wiw_i, LLMs generate and sample kk reasoning paths. 4. Calculate error rate for each wiw_i: error(wi)=num_erroneous_paths/total_paths\text{error}(w_i) = \text{num\_erroneous\_paths} / \text{total\_paths}. 5. An unreliable step stepistep_i is identified if error(wi)>error(wi1)\text{error}(w_i) > \text{error}(w_{i-1}). The transition from stepi1step_{i-1} to stepistep_i is inferloseinfer_{lose}. 6. Generate from wi1w_{i-1} again, sampling a reliable path with a correct final answer as inferwininfer_{win}. 7. Construct preference pairs: (xwi1,inferwin,inferlose)(x || w_{i-1}, infer_{win}, infer_{lose}).

  • Step2Step Data:

1. Use selected problems, including new ones with complex calculations (numbers in original problems replaced with more complex ones). 2. LLM generates reasoning paths, which are sampled and segmented. 3. Use GPT-4 with prompts to find the first step stepklosestep_k^{lose} with a calculation error. 4. GPT-4 corrects it to stepkwinstep_k^{win} and generates the rest of the solution. 5. Verify the LLM's modifications via answer checking. 6. Construct preference data: (xstep0k1,stepkwin,stepklose)(x || step_{0 \sim k-1}, step_k^{win}, step_k^{lose}).

Experiments were conducted on Qwen2-7B-Instruct and Llama3-8B-Instruct models, evaluated on GSM8K and MATH datasets. Training used 30,000 preference data pairs for 8 epochs with a global batch size of 128, a learning rate of 5e-7, and β=0.4\beta=0.4.

Key Results:

  • Main Performance:
    • Qwen2-7B-Instruct + MDPO: +1.7% on GSM8K, +2.3% on MATH.
    • Llama3-8B-Instruct + MDPO: +0.9% on GSM8K, +1.2% on MATH.
  • Comparison with other methods (on Qwen2-7B-Instruct):
    • MDPO (GSM8K: 83.4%, MATH: 56.5%) outperformed DPO (GSM8K: 81.9%, MATH: 54.6%), SimPO (GSM8K: 82.1%, MATH: 54.9%), and Step-DPO (GSM8K: 82.1%, MATH: 55.1%).
    • The improvement over Step-DPO was notable on MATH (1.4% absolute), attributed to MDPO's additional focus on computational capabilities via Step2Step.
  • Ablation Study (on Qwen2-7B-Instruct, GSM8K):
    • Base: 81.7%
    • + Sol2Sol: 82.5%
    • + Sol2Sol + Infer2Infer: 83.2% (Infer2Infer contributed most to reasoning improvement)
    • + Sol2Sol + Infer2Infer + Step2Step (Full MDPO): 83.4%
  • Computational Ability (Step2Step only on Qwen2-7B-Instruct):
    • On GSM-HARD: +3.4% (45.5% vs 42.1% base)
    • On MATH: +1.7% (55.9% vs 54.2% base)
    • This demonstrated Step2Step's effectiveness in enhancing computational skills, outperforming DPO and Step-DPO on these complex datasets.
  • Training Objective Alignment: MDPO significantly increased the proportion of instances where the model assigns a higher probability to the preferred answer (ywy_w) compared to the rejected answer (yly_l), unlike DPO and Step-DPO. This is attributed to aligning the reward function with generation metrics and unifying fine-tuning with downstream tasks. (See Figure 2 in the paper for Win Rate comparison).

Practical Implementation Considerations:

  • Data Construction: The automated data pipeline is a significant practical advantage, reducing reliance on manual annotation. However, it involves multiple LLM generation and verification steps (including calls to GPT-4 for Step2Step error correction), which can be computationally intensive and may depend on the quality of the LLMs used in the pipeline.
  • Computational Resources: While experiments were on 7B/8B models, the training (8 epochs, batch size 128) still requires considerable GPU resources. The authors suggest greater improvements may be seen on larger models.
  • Hyperparameter Tuning: The β\beta (reward scaling) and γ\gamma (reward margin) parameters in the loss function might require tuning for optimal performance on different models or datasets. The paper used β=0.4\beta=0.4.
  • Granularity Trade-offs: While all three granularities contribute, Infer2Infer seems most crucial for general reasoning. For tasks heavy on computation, Step2Step becomes more important. The mix of data from these granularities in the 30,000 pairs was not specified but could be a factor to optimize.
  • Base Model Choice: The method uses instruct-tuned models as a starting point, which is standard in RLHF pipelines. The quality of this initial SFT model can impact MDPO's effectiveness.

In conclusion, MDPO offers a promising approach to improve mathematical reasoning in LLMs by providing more detailed, multi-level supervision signals and aligning training objectives with generation metrics. Its automated data construction pipeline makes it more feasible to implement. The results indicate significant gains over existing DPO variants, particularly in complex reasoning and computation.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (1)
  1. Yunze Lin (2 papers)