Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math (2504.21233v1)

Published 30 Apr 2025 in cs.CL

Abstract: Chain-of-Thought (CoT) significantly enhances formal reasoning capabilities in LLMs by training them to explicitly generate intermediate reasoning steps. While LLMs readily benefit from such techniques, improving reasoning in Small LLMs (SLMs) remains challenging due to their limited model capacity. Recent work by Deepseek-R1 demonstrates that distillation from LLM-generated synthetic data can substantially improve the reasoning ability of SLM. However, the detailed modeling recipe is not disclosed. In this work, we present a systematic training recipe for SLMs that consists of four steps: (1) large-scale mid-training on diverse distilled long-CoT data, (2) supervised fine-tuning on high-quality long-CoT data, (3) Rollout DPO leveraging a carefully curated preference dataset, and (4) Reinforcement Learning (RL) with Verifiable Reward. We apply our method on Phi-4-Mini, a compact 3.8B-parameter model. The resulting Phi-4-Mini-Reasoning model exceeds, on math reasoning tasks, much larger reasoning models, e.g., outperforming DeepSeek-R1-Distill-Qwen-7B by 3.2 points and DeepSeek-R1-Distill-Llama-8B by 7.7 points on Math-500. Our results validate that a carefully designed training recipe, with large-scale high-quality CoT data, is effective to unlock strong reasoning capabilities even in resource-constrained small models.

Summary

  • The paper demonstrates a four-stage training paradigm that enables a 3.8B model to achieve Pass@1 scores of 57.5, 94.6, and 52.0 on AIME, MATH-500, and GPQA benchmarks.
  • It employs sequential distillation, supervised fine-tuning, preference learning, and targeted reinforcement learning to overcome reasoning limitations in small models.
  • Ablation studies confirm the impact of each training stage, offering a practical blueprint for enhancing math reasoning in resource-constrained language models.

This paper, "Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning LLMs in Math" (2504.21233), presents a systematic multi-stage training recipe designed to significantly enhance the mathematical reasoning capabilities of Small LLMs (SLMs). The authors demonstrate that Phi-4-Mini-Reasoning, a compact 3.8-billion-parameter model trained using this recipe, outperforms much larger reasoning models on various math benchmarks.

The core challenge addressed is improving reasoning in SLMs, which typically have limited capacity compared to LLMs. While Chain-of-Thought (CoT) is effective for LLMs, directly applying it or simple distillation techniques to SLMs has yielded inconsistent results, sometimes even degrading performance on reasoning tasks (as shown with LIMO and S1K on Phi-4-Mini in Table 1). The paper argues that SLMs require a more carefully designed training strategy and high-quality data.

The proposed training recipe consists of four sequential stages applied to a pre-trained SLM:

  1. Distillation as Mid-Training: The initial stage involves training the base model on a large-scale corpus of synthetic CoT data. This data, derived from diverse public and in-house sources, includes math questions with CoT-style answers generated by a powerful LLM (DeepSeek-R1 671B) and filtered for correctness via rejection sampling. The model is trained using a standard causal LLMing objective in a packing mode to maximize efficiency, allowing it to absorb foundational CoT reasoning patterns. This stage uses approximately 1.6 million samples totaling around 10 million rollouts.
  2. Distillation as Supervised Fine-tuning (SFT): Following mid-training, a smaller, high-quality subset of the curated CoT data is used for fine-tuning. This subset is selected to be compact yet representative, focusing on diverse math domains and difficulty levels up to graduate level. Fine-tuning is performed in a non-packing mode, teaching the model to generate coherent reasoning chains and decide when to stop. This stage further refines the model's generalization capabilities.
  3. Rollout Preference Learning: This stage leverages the rollouts that were rejected in the distillation phases because they contained incorrect answers. These incorrect rollouts, particularly for 'high-school' level or above math problems, are paired with their corresponding correct rollouts to create a preference dataset. The model is then fine-tuned using Direct Preference Optimization (DPO) (2305.18290). This process utilizes the contrast between correct and incorrect reasoning paths to improve the model's alignment and ability to produce correct outputs, repurposing data that would typically be discarded. The DPO loss objective is: $J_{\mathrm{DPO}(\pi_\theta;\pi_{\text{ref}) = -\mathbb{E}<em>{(x,y_w,y_l) \sim \mathcal{D} \Big[ \log \sigma \Big( \beta \log \frac{\pi</em>{\theta}(y_w | x)}{\pi_{\text{ref}(y_w | x)} <ul> <li>\beta \log \frac{\pi_{\theta}(y_l | x)}{\pi_{\text{ref} (y_l | x)} \Big) \Big]}$whereywy_wis the preferred (correct) rollout andyly_l is the dis-preferred (incorrect) rollout.
  4. Reinforcement Learning (RL) with Verifiable Reward: The final stage applies RL to the preference-trained model to further enhance its reasoning through online learning. The reward function is verifiable: +1 for a correct final answer and -1 otherwise, determined by a math-verification tool and GPT-4o-mini. The authors experienced issues with standard RL methods like GRPO (2402.03300) and DAPO [yu2503dapo] in their setting, particularly with small models, due to high variance in response lengths, vanishing gradients from uniform rewards, and the exploration-exploitation tradeoff. They introduced specific methods to address these:
    • Prompt Optimization: Selecting prompts whose model-generated responses tend to have relatively uniform lengths to mitigate instability.
    • Reward Rebalancing through Oversampling and Filtering: For difficult prompts, oversampling responses and then balancing positive and negative examples to ensure sufficient non-zero advantage signals. Filtering out overly easy prompts.
    • Temperature Annealing: Decaying the sampling temperature linearly from 1.0 to 0.6 over the first half of RL training steps to balance early exploration with later exploitation.

The synthetic CoT data is generated by sampling responses from DeepSeek-R1 (671B) for a diverse set of math problems aggregated from sources like Bespoke [bespoke_stratos], Openthoughts [openthoughts], OpenR1-Math [openr1], and other datasets (AquaRAT (1707.01417), Ape210K (2009.11506), MetaMathQA (2309.12284), MathInstruct (2309.05653), TAL-SCQ5K [TALSCQ5K]). Rollouts are verified using a math tool and GPT-4o-mini.

Evaluated on AIME 2024 [aime], MATH-500 [math-500], and GPQA Diamond [gpqa], Phi-4-Mini-Reasoning achieves Pass@1 scores of 57.5, 94.6, and 52.0 respectively. As shown in Table 2, this performance surpasses open-source models significantly larger in size, such as DeepSeek-R1-Distill-Qwen-7B (53.3 AIME, 91.4 MATH-500, 49.5 GPQA Diamond) and DeepSeek-R1-Distill-Llama-8B (43.3 AIME, 86.9 MATH-500, 47.3 GPQA Diamond). The results demonstrate that this multi-stage approach effectively unlocks strong reasoning capabilities in a resource-constrained model.

Ablation studies confirm the contribution of each stage. Pass@k analysis (Figure 2a) shows that distillation dramatically improves the model's potential to solve problems within multiple attempts, and subsequent RL training further enhances this. Comparison of RL methods (Figure 2b) highlights the stability and effectiveness of their tailored RL recipe compared to DAPO, which showed performance degradation in their setup.

The paper concludes that a carefully designed, multi-stage training paradigm integrating large-scale distillation, preference learning from both correct and incorrect rollouts, and stable RL is crucial for developing high-performing reasoning SLMs. This work provides a practical blueprint for maximizing the reasoning capabilities of compact models.

Youtube Logo Streamline Icon: https://streamlinehq.com