Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
Gemini 2.5 Pro
GPT-5
GPT-4o
DeepSeek R1 via Azure
2000 character limit reached

Phi-4 Reasoning Model

Updated 14 August 2025
  • Phi-4 Reasoning Model is a family of 14B parameter language models that generates explicit, step-by-step reasoning using specialized tokens to separate thought chains from final answers.
  • It employs curated datasets, supervised fine-tuning, and chain-of-thought exemplars to excel in math, coding, and scientific reasoning tasks with robust benchmark performance.
  • The reinforcement learning phase further refines reasoning trace quality, improving accuracy on complex tasks like AIME-style math and algorithmic problem-solving.

Phi-4-Reasoning Model refers to a family of LLMs, culminating in the 14-billion parameter Phi-4-reasoning and its reinforcement learning–enhanced variant Phi-4-reasoning-plus, expressly designed for high-level, explicit, multi-step reasoning across a wide range of complex domains, with strong performance in math, coding, scientific reasoning, algorithmic problem solving, planning, and spatial understanding. Central to this line is a training methodology focused on curated, “teachable” prompts with chain-of-thought exemplars generated by strong teacher models, meticulous supervised fine-tuning, and targeted outcome-based reinforcement learning, yielding models that not only produce detailed reasoning traces via specialized tokens but also generalize improvements to general-purpose benchmarks. Both model architecture and training protocol are finely tuned to maximize inference-time computation by leveraging long context capabilities and explicit internal reasoning chains.

1. Core Architecture and Reasoning Trace Support

Phi-4-reasoning builds directly upon the base Phi-4 model, which is a decoder-only transformer with 14 billion parameters. The reasoning variants introduce focused architectural adjustments:

  • Chain-of-Thought Delimitation: Two tokens are repurposed for > and </think>, demarcating the boundaries of the generated “thinking block.” This instructs the model explicitly where to conduct detailed step-by-step reasoning, separated from the concise final answer. > > - Extended Context Length: The model’s maximum token sequence is expanded from 16K to 32K via RoPE (Rotary Position Embedding) base frequency doubling. Training on longer context windows ensures retention of detailed, multi-step reasoning traces without context truncation or coherence loss across long reasoning chains. > > These architectural features allow for generation and parsing of highly explicit logical traces at inference, which is crucial for tasks requiring deep multi-step computation or demonstration of intermediate decision steps. > > ## 2. Training Data Curation and Supervised Fine-Tuning (SFT) > > The SFT phase is underpinned by a curated dataset of over 1.4 million prompt–response pairs, including math, coding, algorithmic, and safety-oriented queries. Training corpus characteristics: > > - Synthetic Reasoning Demonstrations: Responses are generated by strong teacher models (notably o3-mini), ensuring reasoning traces are both correct and pedagogically structured for the target model to imitate. > > - Prompt Selection: Only “teachable” prompts are included—problems at a complexity level challenging for the base model but tractable with appropriate reasoning. > > - Explicit Reasoning Block Structure: The supervised data are formatted so that the first part, between <think> and </think>, contains the full reasoning chain, and the second part provides a crisp answer, facilitating clear separation of intermediate computation and final conclusion. > > Supervised fine-tuning implements learning rate and batch size parameters optimized for reasoning trace depth and model size (learning rates within [1e-6, 2e-5], batch size 32, 16K training steps, 32K context). > > ## 3. Outcome-Based Reinforcement Learning > > Phi-4-reasoning-plus advances performance further through a brief but highly targeted outcome-based RL phase: > > - Group Relative Policy Optimization (GRPO) Variant: RL is focused exclusively on mathematical tasks inadequately solved during SFT (~6K difficult math problems). > > - Rule-Based Reward Function: The reward is strictly rule-based—combining length-aware accuracy (promoting conciseness on correct responses, longer “thinking” on incorrect), and a repetition penalty to avoid redundancy. For correct answers, the reward utilizes a cosine function of the response length (relative to a control length), ensuring that excessive verbosity is discouraged and that insufficient "thinking" (dead reckoning) is penalized when errors are made. > > - Training Objective: RL maximizes, over groups of points, the clipped ratio of new/old policy probabilities multiplied by a group-relative advantage, complemented by KL and entropy regularization terms. > > This result is productivity of longer, more detailed reasoning traces when necessary, with conciseness kept in check, and overall improvement of accuracy in benchmarked tasks (notably ~10–15% improvement in difficult AIME-style math questions). > > ## 4. Benchmark Performance and Reasoning Demonstrations > > Evaluation spans a diverse set of benchmarks and task types: > > - Math and Science: High accuracy on AIME (American Invitational Mathematics Examination), Omni-Math, and GPQA Diamond. > > - Algorithmic and Coding: Notable results on 3SAT, Traveling Salesman Problem, and LiveCodeBench, indicating step-by-step code and algorithmic reasoning. > > - Planning/Spatial Reasoning: Strong performance in calendar scheduling and maze-solving tasks. > > - General Transfer: Non-trivial improvement transference to tasks outside the explicit RL/SFT domains, such as IFEval (instruction following), FlenQA (long-context question answering), and ArenaHard (complex dialogue and chat tasks), with accuracy improvements of 10–20 percentage points compared to the base Phi-4. > > - Chain-of-Thought Examples: Reasoning block output includes stepwise breakdown in tasks such as counting occurrences in a string, solving riddles, probability questions (with answer transformation e.g., Greek letters in reverse), planning under constraints, and decomposing algorithmic or spatial operations. > > ## 5. Transfer, Generalization, and Robustness > > A salient property revealed by comprehensive evaluation is the “non-trivial transfer” of the improvements from the reasoning-focused (often STEM) data distribution to broader, general-purpose domains: > > - Instruction Following and Long-Context Reasoning: Gains in MMLUPro, HumanEvalPlus, and IFEval indicate that explicit reasoning chain training also promotes better handling of complex, unconstrained user instructions and highly compositional tasks. > > - Safety and Responsible AI Metrics: Enhanced detection and moderation of subtle toxicity emergent from better internal reasoning chain consistency. > > - Variance in Evaluation: The paper identifies significant run-to-run variance on small benchmarks (e.g., AIME 2025’s 30-problem test set), highlighting the need for robust reported evaluation metrics (standard deviations, multiple independent runs) for fair model comparisons. > > ## 6. Methodological Insights and Opportunities for Further Improvement > > Several methodological lessons and forward-looking recommendations are highlighted: > > - Necessity of Data Curation: Performance gains depend crucially on filtering for “teachable” prompts; indiscriminate SFT on arbitrary data does not yield similar improvement. The use of teacher models for synthetic data creation, as well as explicit chain-of-thought formatting, is essential. > > - RL Task Coverage: Reinforcement learning applied only to math boosts math accuracy and reasoning trace length/quality, but gains in coding, planning, or spatial tasks are less pronounced unless those domains are also emphasized in RL training. Strategic extension of RL to more diverse task types could further broaden improvements. > > - Inference Cost and Chain Length: The model is trained to use more “inference-time compute” through longer chains, especially when initial answers are incorrect. Managing the balance between chain length (for difficult cases) and token efficiency (avoiding verbosity) remains an ongoing point of refinement. > > - Evaluation with LLM Judges: Current LLM judges may not fully capture the correctness or sufficiency of long reasoning traces; improved chain-of-thought evaluation, perhaps incorporating compressed or “summary” traces, could facilitate both safety and interpretability. > > ## 7. Summary Table: Comparison of Phi-4-reasoning Variants > > | Model | Base Parameters | Max Sequence | Training Enhancements | Reasoning Chain Markers | SFT Domains | RL Coverage | > |------------------------|----------------|--------------|-------------------------------|------------------------|-------------|---------------------| > | Phi-4-reasoning | 14B | 32K tokens | Curated SFT w/ chain-of-thought | <think> ... </think> | Math, coding, safety | — | > | Phi-4-reasoning-plus | 14B | 32K tokens | As above + outcome-based RL | <think> ... | Math, coding, safety | Math tasks (GRPO) |

8. Concluding Observations

Phi-4-reasoning demonstrates that careful data curation, explicit chain-of-thought structuring, and targeted application of outcome-based reinforcement learning can endow LLMs—even at moderate scale (14B parameters)—with strong multi-step reasoning abilities. It outperforms much larger open-weight models (including DeepSeek-R1-Distill-Llama-70B), approaches or matches the performance of the largest proprietary systems in core STEM tasks, and promotes meaningful transfer to general-purpose reasoning abilities. However, the findings also suggest that further generalization across task types, enhanced RL methodologies, robust evaluation protocols, and fine control over chain length and verbosity remain active areas for continued advancement.