Phi-4 Technical Report (2412.08905v1)

Published 12 Dec 2024 in cs.CL and cs.AI

Abstract: We present phi-4, a 14-billion parameter LLM developed with a training recipe that is centrally focused on data quality. Unlike most LLMs, where pre-training is based primarily on organic data sources such as web content or code, phi-4 strategically incorporates synthetic data throughout the training process. While previous models in the Phi family largely distill the capabilities of a teacher model (specifically GPT-4), phi-4 substantially surpasses its teacher model on STEM-focused QA capabilities, giving evidence that our data-generation and post-training techniques go beyond distillation. Despite minimal changes to the phi-3 architecture, phi-4 achieves strong performance relative to its size -- especially on reasoning-focused benchmarks -- due to improved data, training curriculum, and innovations in the post-training scheme.

PDF HTML Abstract

This technical report introduces phi-4, a 14-billion parameter LLM that leverages high-quality synthetic data and advanced post-training techniques to achieve strong performance, particularly in reasoning-focused tasks. Phi-4 distinguishes itself through its strategic incorporation of synthetic data throughout the training process, optimized training curriculum, and innovations in post-training.

The model development is structured around three key components:

Synthetic data generation for pretraining and midtraining.
Curation and filtering of high-quality organic data.
Post-training refinements using Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO).

Synthetic data is generated using methods such as multi-agent prompting, self-revision workflows, and instruction reversal. The model uses synthetic data in post-training via rejection sampling and a novel approach to DPO.

The principles guiding the synthetic data approach include:

Diversity across subtopics and skills.
Nuance and complexity to reflect real-world challenges.
Accuracy in code execution, proofs, and explanations.
Chain-of-Thought reasoning to foster coherent outputs.

The pretraining phase involved 50 types of synthetic datasets, accumulating approximately 400B unweighted tokens. Novel methodologies used in generating synthetic datasets for phi-4 include seed curation from web and code-based sources, question datasets, and creation of question-answer pairs from diverse sources.

Data curation and filtering emphasized high-quality problems and solutions from public websites, existing datasets, and acquired external datasets. Synthetic augmentation was applied to the organic questions to create a larger dataset. High-quality organic data sources for phi-4 prioritized reasoning-dense and nuanced material such as academic papers, educational forums, and programming tutorials.

The architecture is based on a decoder-only transformer with 14B parameters and a 4096 context length, extended to 16K during midtraining. It closely follows phi-3-medium, using the tiktoken tokenizer with a padded vocabulary size of 100,352 and full attention over the 4K context length. The model was pretrained for approximately 10T tokens using linear warm-up and decay schedules with a peak learning rate of 0.0003, constant weight decay of 0.1, and a global batch size of 5760.

Ablation studies were conducted to optimize the data mixture, including synthetic data, web rewrites, filtered web data, targeted acquisitions, organic data, and code data. The final data mixture allocates 30% of training tokens to web and web rewrites, 40% to synthetic data, 20% to code data, and 10% to targeted acquired sources.

Midtraining involved increasing the context length from 4K to 16K. A data mixture includes 30% of newly curated longer context data and 70% recall tokens from the pretraining stage. The base frequency of rope position encoding was increased to 250K.

Post-training involved SFT on 8B tokens of data and two rounds of DPO, with data mixtures detailed in the appendix. A pivotal token search (PTS) technique was introduced to generate pairs for DPO, targeting tokens that have a significant impact on the probability of success.

PTS identifies points of a completion token sequence where the next token has a significant impact on the probability of success. PTS estimates these probabilities by sampling completions starting from $Q+t_1,t_2,\dotsc,t_i$ , which are checked for correctness with an oracle for $Q$ .

$Q$ : User Query

$t_i$ : Token in completion token sequence

Hallucination mitigation was addressed through SFT data and DPO pairs, encouraging the model to refuse to answer if it does not know the answer.

The model was evaluated using benchmarks such as MMLU, GPQA, MATH, HumanEval, MGSM, SimpleQA, DROP, MMLUPro, HumanEval+, ArenaHard, and IFEval. It excels at STEM Q&A tasks and scores highly in coding benchmarks.

Responsible AI (RAI) principles were followed, with safety alignment in post-training, red-teaming, and automated testing across dozens of RAI harm categories. In-house RAI benchmarks were used to compare performance against other models, focusing on grounding and harmfulness.

The model demonstrates strong defenses against techniques aimed at subverting its safety training, such as jailbreaks and prompt encodings. However, challenges remain around factual knowledge, strict instruction following, and potential biases.

PDF Markdown Bookmark Chat (Pro)

Authors (27)

Marah Abdin (5 papers)
Jyoti Aneja (9 papers)
Harkirat Behl (9 papers)
Sébastien Bubeck (90 papers)
Ronen Eldan (60 papers)
Suriya Gunasekar (34 papers)
Michael Harrison (22 papers)
Russell J. Hewett (14 papers)
Mojan Javaheripi (19 papers)
Piero Kauffmann (4 papers)
James R. Lee (37 papers)
Yin Tat Lee (102 papers)
Yuanzhi Li (119 papers)
Weishung Liu (3 papers)
Caio C. T. Mendes (2 papers)
Anh Nguyen (157 papers)
Eric Price (74 papers)
Gustavo de Rosa (4 papers)
Olli Saarikivi (16 papers)
Adil Salim (28 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/_akhaliq/status/1867430320666210429

https://twitter.com/_philschmid/status/1867491584196923562

https://twitter.com/JeremyCMorgan/status/1877148485037830521

https://twitter.com/drdannywilliams/status/1869008842266493304

https://twitter.com/calebfahlgren/status/1877124309249904903

https://twitter.com/BehlHarkirat/status/1868155117201191183

YouTube

Show All Videos

HackerNews

Phi-4 Technical Report (3 points, 0 comments)
Phi-4 Technical Report (2 points, 0 comments)
Phi-4 Technical Report (1 point, 0 comments)