Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Abstract: We introduce phi-3-mini, a 3.8 billion parameter LLM trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. Our training dataset is a scaled-up version of the one used for phi-2, composed of heavily filtered publicly available web data and synthetic data. The model is also further aligned for robustness, safety, and chat format. We also provide parameter-scaling results with a 7B, 14B models trained for 4.8T tokens, called phi-3-small, phi-3-medium, both significantly more capable than phi-3-mini (e.g., respectively 75%, 78% on MMLU, and 8.7, 8.9 on MT-bench). To enhance multilingual, multimodal, and long-context capabilities, we introduce three models in the phi-3.5 series: phi-3.5-mini, phi-3.5-MoE, and phi-3.5-Vision. The phi-3.5-MoE, a 16 x 3.8B MoE model with 6.6 billion active parameters, achieves superior performance in language reasoning, math, and code tasks compared to other open-source models of similar scale, such as Llama 3.1 and the Mixtral series, and on par with Gemini-1.5-Flash and GPT-4o-mini. Meanwhile, phi-3.5-Vision, a 4.2 billion parameter model derived from phi-3.5-mini, excels in reasoning tasks and is adept at handling both single-image and text prompts, as well as multi-image and text prompts.
- Meta AI. Introducing meta llama 3: The most capable openly available llm to date, 2023.
- Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
- Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022.
- Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions, 2024.
- Piqa: Reasoning about physical commonsense in natural language. arXiv preprint arXiv:1911.11641, 2019.
- Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- Boolq: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936, 2019.
- Evaluating large language models trained on code, 2021.
- Longrope: Extending llm context window beyond 2 million tokens, 2024.
- Textbooks are all you need. arXiv preprint arXiv:2306.11644, 2023.
- Measuring mathematical problem solving with the MATH dataset, 2021.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
- Phi-2: The surprising power of small language models. Microsoft Research Blog, 2023.
- Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension, 2017.
- Beavertails: Towards improved safety alignment of llm via a human-preference dataset, 2023.
- What disease does this patient have? a large-scale open domain question answering dataset from medical exams, 2020.
- Mistral 7b, 2023.
- Mixtral of experts, 2024.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463, 2023.
- Truthfulqa: Measuring how models mimic human falsehoods, 2022.
- Can a suit of armor conduct electricity? a new dataset for open book question answering, 2018.
- Scaling data-constrained language models. arXiv preprint arXiv:2305.16264, 2023.
- Adversarial nli: A new benchmark for natural language understanding, 2020.
- Gpqa: A graduate-level google-proof q&a benchmark, 2023.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Winogrande: An adversarial winograd schema challenge at scale. arXiv preprint arXiv:1907.10641, 2019.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
- Challenging big-bench tasks and whether chain-of-thought can solve them, 2022.
- Commonsenseqa: A question answering challenge targeting commonsense knowledge, 2019.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Gemma: Open models based on gemini research and technology, 2024.
- Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, 2017.
- Agieval: A human-centric benchmark for evaluating foundation models, 2023.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
- Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, 2019.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.