Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
122 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
4 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

Phi-4-reasoning Technical Report (2504.21318v1)

Published 30 Apr 2025 in cs.AI and cs.CL

Abstract: We introduce Phi-4-reasoning, a 14-billion parameter reasoning model that achieves strong performance on complex reasoning tasks. Trained via supervised fine-tuning of Phi-4 on carefully curated set of "teachable" prompts-selected for the right level of complexity and diversity-and reasoning demonstrations generated using o3-mini, Phi-4-reasoning generates detailed reasoning chains that effectively leverage inference-time compute. We further develop Phi-4-reasoning-plus, a variant enhanced through a short phase of outcome-based reinforcement learning that offers higher performance by generating longer reasoning traces. Across a wide range of reasoning tasks, both models outperform significantly larger open-weight models such as DeepSeek-R1-Distill-Llama-70B model and approach the performance levels of full DeepSeek-R1 model. Our comprehensive evaluations span benchmarks in math and scientific reasoning, coding, algorithmic problem solving, planning, and spatial understanding. Interestingly, we observe a non-trivial transfer of improvements to general-purpose benchmarks as well. In this report, we provide insights into our training data, our training methodologies, and our evaluations. We show that the benefit of careful data curation for supervised fine-tuning (SFT) extends to reasoning LLMs, and can be further amplified by reinforcement learning (RL). Finally, our evaluation points to opportunities for improving how we assess the performance and robustness of reasoning models.

Summary

  • The paper presents Phi-4-reasoning and Phi-4-reasoning-plus, advanced 14B models that generate explicit reasoning traces using supervised fine-tuning and outcome-based reinforcement learning.
  • It details a data-centric approach with synthetic seed data and rigorous filtering, leading to significant performance gains on benchmarks like AIME and Omni-MATH.
  • The report highlights key innovations such as introduction of reasoning tokens and extended context lengths while addressing computational costs and safety trade-offs.

Here is a detailed summary of the "Phi-4-reasoning Technical Report" (2504.21318).

The paper introduces Phi-4-reasoning and Phi-4-reasoning-plus, 14-billion parameter LLMs designed for complex reasoning tasks. These models are built upon the existing Phi-4 base model (2504.21318), specializing it for generating detailed, step-by-step reasoning chains. The training process involves two main stages: supervised fine-tuning (SFT) for Phi-4-reasoning, and a subsequent outcome-based reinforcement learning (RL) phase for Phi-4-reasoning-plus. The models aim to leverage inference-time compute effectively by producing explicit reasoning traces, thereby improving performance on tasks requiring multi-step decomposition, internal reflection, and exploration.

The authors emphasize a data-centric approach, extending the methodology used in previous Phi models. Key to this is the curation of a "teachable" dataset.

Data Methodology

The data curation process begins with a diverse collection of prompts and problems from web sources, existing datasets, and synthetic generation, forming a "seeds database". A crucial step is filtering this database to select prompts where the base Phi-4 model shows potential for improvement, lying at the edge of its capabilities and requiring complex multi-step reasoning rather than just factual recall. LLM-based evaluation and filtering pipelines are used to identify these prompts, using metrics like agreement rate with a stronger model's responses as a proxy for difficulty.

A subset of filtered seeds is transformed into new synthetic datasets to better align with targeted reasoning skills, such as rewriting coding problems into word problems or math problems for easier verification. This synthetic data is crucial for both SFT and RL.

The SFT training data for Phi-4-reasoning uses synthetically generated responses containing detailed reasoning traces structured with > and </think> tokens, followed by a concise answer. The data comprises over 1.4 million prompt-response pairs (8.3 billion unique tokens) covering STEM, coding, and safety topics. Rigorous decontamination is performed against a wide range of public and internal benchmarks, including AIME-2024, MATH, GPQA, LiveCodeBench, Codeforces, and many others. AIME-2025 is noted as being contamination-free as it was released after data finalization. Safety data includes prompts augmented with detailed safety guidelines, which are removed from the prompt during training to encourage implicit learning of responsible AI behavior.

Phi-4-reasoning: Supervised Finetuning (SFT)

Phi-4-reasoning is the result of supervised fine-tuning the 14B Phi-4 base model. The model architecture includes two primary modifications from the base Phi-4:

  1. Reasoning Tokens: <think> and `` tokens are introduced to delineate reasoning blocks.
  1. Increased Context Length: The maximum context length is increased from 16K to 32K tokens by doubling the RoPE base frequency, allowing for longer reasoning traces.

SFT is performed using the curated dataset for approximately 16K steps with a global batch size of 32 and a context length of 32K. AdamW optimizer is used with a learning rate of 10510^{-5}, linear warmup, and weight decay.

The SFT stage significantly improves reasoning performance across diverse benchmarks and generalizes to tasks not directly in the training data. It also maintains or improves performance on general-purpose benchmarks. Training dynamics show that while the think token usage appears early, the quality and efficacy of reasoning improve throughout SFT. Response length surprisingly slightly decreases during SFT, suggesting learned token efficiency.

The authors conducted extensive SFT experiments, divided into exploration and scaling stages.

  • Exploration: Focused on hyperparameter tuning (learning rate of 10510^{-5} found optimal), the role of synthetic seed data (adding synthetic math data yielded 3-10% gains on AIME), and the role of the system message (a specific reasoning-focused system message increased robustness). They found an "additive property" in data mixture optimization, where mixtures could be optimized for domains (math, code) independently and then combined.
  • Scaling: The optimized recipe and data mixture across domains (math, code, logical puzzles, safety) were scaled up. Using o3-mini with "high reasoning effort" as a teacher produced stronger performance and longer responses, leading to the context length increase to 32K. Phi-4 was chosen as the base model over a mid-trained checkpoint due to its superior safety and alignment properties.

Phi-4-reasoning-plus: Reinforcement Learning (RL)

Phi-4-reasoning-plus is an enhancement of Phi-4-reasoning achieved through outcome-based reinforcement learning. The method used is Group Relative Policy Optimization (GRPO) [shao2024deepseekmath, guo2025deepseek], tailored to their setup. The RL training focuses exclusively on mathematical reasoning using a seed dataset of 72,401 math problems, sampling 64 per iteration.

Reward Function: A rule-based reward model is employed to incentivize correctness, penalize repetition/excessive length, and encourage proper formatting. The core component is a length-aware accuracy score ($R_{\text{acc\_scaled}$). A raw binary accuracy ($R_{\text{acc\_raw} \in \{0,1\}$) is determined by verifying the final answer (e.g., in a \boxed{} tag). The length-aware reward function encourages concise correct answers (reward increases from 0.5 to 1.0 for lengths up to $L_{\text{pos\_control}$) and longer incorrect answers (reward ranges from -1.0 to -0.5, with penalty decreasing up to $L_{\text{neg\_control}$), using cosine scaling. Penalties are applied for incompleteness (-0.5) or invalid thinking block format (-1.0). A repetition penalty ($R_{\text{rep}$) based on 5-gram frequency is also included. The final reward is a weighted sum: $R_{\text{final} =w_{\text{acc} R_{\text{acc\_scaled} + w_{\text{rep} R_{\text{rep}$.

Training Details: GRPO training uses a global batch size of 64, Adam optimizer (5×1085\times 10^{-8} LR), GRPO group size G=8G=8, KL regularization β=0.001\beta=0.001, and entropy coefficient γ=0.001\gamma=0.001. The Phi-4-reasoning-plus checkpoint is selected based on the best AIME 2024 score after 90 steps.

Observations during RL training show that GRPO boosts AIME performance significantly from the SFT baseline. Performance correlates strongly with response length, and the reward model design encourages incorrect answers to become longer, presumably thinking more. Length clipping at 31K tokens limits further potential gains. Entropy remains healthy, suggesting continued exploration. Extending context support to 64K or more is suggested for future work.

Evaluation

The models are evaluated on reasoning-specific and general-purpose benchmarks.

Reasoning Benchmarks: Evaluated on AIME (1983-2025), HMMT Feb 2025, Omni-MATH, GPQA Diamond, BA-Calendar, TSP-Opt, 3SAT-Search, Maze, and SpatialMap. Baselines include DeepSeek-R1/Distill-70B, o1/o3-mini, Claude 3.7 Sonnet, and Gemini 2 Flash Thinking. Evaluation uses temperature 0.8 for Phi models, 0.6 for DeepSeek, and typically 1.0 or default for others. Max tokens are set as high as possible, up to 32K (or 65K on some benchmarks for Phi models without specific training). CoT prompt templates are used universally, except for o1 due to refusals.

  • Key Findings:
    • Major improvements over base Phi-4 on all reasoning tasks (50%+ on AIME/Omni-MATH, 25%+ on LiveCodeBench, 30-60% on TSP, 3SAT, BA-Calendar).
    • Phi-4-reasoning and Phi-4-reasoning-plus are competitive with or outperform much larger models like DeepSeek-R1/Distill-70B and o1/o3-mini on math reasoning and Omni-MATH. They outperform Claude 3.7 Sonnet and Gemini 2 Flash Thinking on most tasks except GPQA and Calendar Planning.
    • Phi-4-reasoning-plus shows important gains over Phi-4-reasoning on math (15% on AIME 2025) and some generalization tasks (5% on Omni-MATH, TSP), likely due to the math-focused RL. Gains are less pronounced on coding, planning, and spatial tasks.
    • Evaluation on AIME 2025 (30 problems) shows high accuracy variance across models over multiple runs (50 runs used for robustness), highlighting the unreliability of single-run comparisons. Phi-4-reasoning-plus distribution largely intersects with o3-mini's.
    • Analysis reveals opportunities for improvement: smaller gains on biology/chemistry (GPQA) compared to math/physics, and lower performance on discrete math/geometry (Omni-MATH). Performance drops across years on AIME, especially recent ones, for all models.
  • Performance vs. Token Usage: Phi-4-reasoning-plus uses ~1.5x more tokens than Phi-4-reasoning on average. Both use similar or less tokens than o3-mini on average, depending on the benchmark. Token usage variability per instance is comparable across models.
  • Average vs. Best-of-N: Comparing average pass@1 accuracy to best-of-5 reveals a significant gap across models and benchmarks, indicating potential for further improvement by improving verification methods or decoding strategies.

General-purpose Benchmarks: Evaluated on MMLU, MGSM, MMLU-pro, HumanEvalPlus, ArenaHard, FlenQA, Toxigen, Kitab, and PhiBench. Phi-4-reasoning models were evaluated at temp 0.8, while Phi-4 was at 0.0 for general tasks.

  • Key Findings:
    • Reasoning models show non-trivial and often large improvements over Phi-4 on general tasks.
    • Significant gains on FlenQA (long context QA, robustness to padding), IFEval (instruction following), ArenaHard (chat interaction), HumanEvalPlus (coding), and PhiBench.
    • Modest improvements on MMLUPro.
    • Kitab (information retrieval) shows improved precision (no context) and precision/recall (with context), approaching o3-mini when RAG is used. Factuality via parametric knowledge alone remains challenging.

Safety Evaluation: Assessed using Automated RAI Measurement Framework and Toxigen.

  • Key Findings:
    • Minor regression in safety compared to base Phi-4 according to Automated RAI Framework.
    • Toxigen shows a trade-off between detecting toxic and neutral content. Phi-4-reasoning achieves a better balance than Phi-4 and Phi-4-reasoning-plus, which is desirable for moderation. Group-based fairness shows improvement.
    • Evaluating safety in reasoning models with long traces is challenging for current LLM judges, potentially leading to false positives or missed issues. Further research is needed in this area.

Limitations

Phi-4-reasoning models inherit limitations from Phi-4, including primary support for English, potential for stereotypes/bias, and generating plausible but inaccurate information. Coding performance is best for Python.

Additional limitations specific to the reasoning models include:

  • Increased computational cost and slower response times due to longer reasoning traces.
  • Potential for responses to contradict their own reasoning chains.
  • Limited context length of 32K (though tested up to 64K without specific training), which can hinder performance on complex tasks and lead to truncation.
  • Training data focused on specific domains (STEM, coding, safety for SFT; math for RL), potentially limiting generalization to vastly different contexts.

The report concludes by highlighting the benefits of careful data curation for SFT and the potential of combining SFT and RL for developing efficient, high-performing reasoning models. It also underscores the need for more rigorous evaluation practices beyond single-score reporting, especially for small datasets and in the presence of model non-determinism.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com