Phi-4 Mini Reasoning: Compact Model Breakthrough

Updated 29 October 2025

The paper introduces a novel four-step training regimen that leverages Chain-of-Thought reasoning to significantly improve math problem-solving in the 3.8B-parameter Phi-4-Mini model.
It combines mid-training on distilled long-CoT data, supervised fine-tuning on complex problems, Direct Preference Optimization, and reinforcement learning with verifiable rewards.
Benchmark evaluations on Math-500, AIME 2024, and GPQA Diamond demonstrate that Phi-4-Mini-Reasoning outperforms larger models, showcasing efficiency in reasoning despite its compact size.

explores how to optimize reasoning in small LLMs using a four-step training method, focused on leveraging Chain-of-Thought (CoT) reasoning. This study specifically addresses Phi-4-Mini, a compact 3.8-billion-parameter model, and shows that with strategic training, these models can outperform significantly larger models on math reasoning tasks.

1. Introduction to Phi-4-Mini-Reasoning

Phi-4-Mini-Reasoning enhances the reasoning capabilities of Phi-4-Mini, a compact LLM, using a systematically designed training regimen. The investigation reveals that well-engineered training stages can significantly strengthen the reasoning performance of small LLMs, thereby challenging the assumption that larger models inherently possess superior reasoning abilities.

2. Four-Step Training Recipe

Step 1: Mid-Training on Distilled Long-CoT Data

Objective: Equip the model with broad reasoning skills by mid-training it on distilled long CoT data derived from a powerful LLM, such as DeepSeek-R1. The focus is on solving complex problems across a diverse range of topics.

Process: Approximately 10 million reasoning tasks were generated, ensuring correctness through verification and careful re-evaluation. This phase establishes a foundation for general reasoning capabilities in the model.

Step 2: Supervised Fine-Tuning with Quality Long-CoT Data

Objective: Fine-tune the model on a curated set of long CoT data, focusing on difficult problems that demand intricate reasoning chains.

Contribution: This process enhances the model's ability to generalize and effectively tackle complex reasoning tasks, strengthening the accuracy and robustness of its problem-solving approaches.

Step 3: Rollout DPO with Curated Preferences

Objective: Enhance the model's decision-making by applying Direct Preference Optimization (DPO). This involves leveraging curated data sets with both preferred and dis-preferred reasoning paths.

Methodology: Implement DPO using a preference-based loss function that rewards trajectories labeled as "preferred" and penalizes those labeled as "dis-preferred."

Step 4: Reinforcement Learning with Verifiable Reward

Objective: Further boost reasoning capabilities through Reinforcement Learning (RL) by validating model-generated solutions.

Approach: Deploy RL strategies where solutions are programmatically verified, promoting correct reasoning while penalizing errors. Algorithms used include Proximal Policy Optimization (PPO) and Group-based Relative Policy Optimization (GRPO).

3. Benchmark Evaluation

The model was assessed on several math reasoning benchmarks:

Math-500: A comprehensive assessment of 500 diverse math problems.
AIME 2024: High-level math competition problems.
GPQA Diamond: Advanced, graduate-level questions.

Performance Insights

Phi-4-Mini-Reasoning sets new performance standards by outperforming larger models by substantial margins.
It surpasses DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-Llama-8B in the Math-500 benchmark.
The model demonstrates its effectiveness in handling complex reasoning tasks even with significantly fewer parameters.

4. Implications for Small LLMs

Phi-4-Mini-Reasoning exemplifies how small LLMs, despite their limited size, can achieve or even surpass the reasoning abilities of much larger counterparts through structured data-centric training strategies.

Data Quality Over Quantity: The study emphasizes the importance of high-quality, strategic data shaping over mere scale expansion.
Efficiency in Reasoning: Effective training can extract high-level reasoning capabilities from compact models, ensuring practical applicability even in resource-constrained environments.

5. Conclusion

The Phi-4-Mini-Reasoning project demonstrates that compact models can achieve exceptional reasoning performance with appropriate training methods. By employing structured CoT data, preference-driven learning, and validation-focused RL, the model not only bridges the gap with larger architectures but also leads the way in establishing new benchmarks for small LLMs. This research highlights the potential of strategically trained compact models to democratize access to advanced AI capabilities, allowing robust deployment in various domains without needing extensive computing resources.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Phi-4-mini-reasoning.