Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone (2404.14219v4)

Published 22 Apr 2024 in cs.CL and cs.AI

Abstract: We introduce phi-3-mini, a 3.8 billion parameter LLM trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. Our training dataset is a scaled-up version of the one used for phi-2, composed of heavily filtered publicly available web data and synthetic data. The model is also further aligned for robustness, safety, and chat format. We also provide parameter-scaling results with a 7B, 14B models trained for 4.8T tokens, called phi-3-small, phi-3-medium, both significantly more capable than phi-3-mini (e.g., respectively 75%, 78% on MMLU, and 8.7, 8.9 on MT-bench). To enhance multilingual, multimodal, and long-context capabilities, we introduce three models in the phi-3.5 series: phi-3.5-mini, phi-3.5-MoE, and phi-3.5-Vision. The phi-3.5-MoE, a 16 x 3.8B MoE model with 6.6 billion active parameters, achieves superior performance in language reasoning, math, and code tasks compared to other open-source models of similar scale, such as Llama 3.1 and the Mixtral series, and on par with Gemini-1.5-Flash and GPT-4o-mini. Meanwhile, phi-3.5-Vision, a 4.2 billion parameter model derived from phi-3.5-mini, excels in reasoning tasks and is adept at handling both single-image and text prompts, as well as multi-image and text prompts.

Highly Capable LLM for Local Deployment on Mobile Devices

Introduction to phi-3-mini

In this technical report, Microsoft researchers introduced phi-3-mini, a LLM with 3.8 billion parameters, trained on 3.3 trillion tokens. Despite its relatively small size, phi-3-mini demonstrates comparable performance to significantly larger models such as Mixtral 8x7B and GPT-3.5 on standard benchmarks like MMLU and MT-bench. This achievement is primarily ascribed to its sophisticated training data, which integrates filtered web data and synthetic data. The model is adapted to run efficiently on modern mobile phones, offering local, offline linguistic processing capabilities that were previously unattainable without cloud computing resources.

Model Architecture and Training

Phi-3-mini leverages a transformer decoder architecture and is precisely tuned for chat-based interactions. This model is an extension of previous models like phi-2, utilizing a more refined dataset and training regimen. The mobile-friendly design is quantized to a 4-bit model, reducing its memory footprint to approximately 1.8GB, thus allowing its deployment on handheld devices without sacrificing performance. The phi-3-small and phi-3-medium models extend this architecture to 7 billion and 14 billion parameters, respectively, illustrating notable improvements in MMLU and MT-bench scores as parameters increase.

Data Strategy and Optimization

The data used for training phi-3-mini adheres to a sophisticated filtering process focused on enhancing the model's reasoning capabilities and knowledge base, while optimizing for the "data optimal regime" rather than merely scaling up compute resources. This targeted data selection process allows smaller models to perform at the levels expected of much larger models, essentially improving computational efficiency and model responsiveness.

Benchmarks and Performance Comparisons

Phi-3-mini achieved significant scores on various benchmarks:

  1. MMLU: 69% accuracy, remarkable for its size category.
  2. HellaSwag: Achieved 76.7%, competing closely with larger models.
  3. ANLI: Scored 52.8%, showcasing a strong understanding of adversarial narratives.
  4. GSM-8K: With 82.5%, it excels in generative science QA.
  5. MT-bench: Scored 8.38, emphasizing its robust multitasking abilities.

Additionally, this model outperforms phi-2 across all benchmarks and often rivals or exceeds the capabilities of larger models, such as Mistral 7B and GPT-3.5.

Safety and Ethical Considerations

The development process for phi-3-mini included rigorous safety and ethical training to mitigate potential harms. Utilizing a combination of supervised instruction fine-tuning and preference tuning, the model was aligned with safe response generation standards. Moreover, an independent red teaming process was employed to identify and amend possible weaknesses in the model's output, ensuring a robust framework for responsible deployment.

Future Implications and Developments

The achievements of phi-3-mini suggest significant potential for deploying powerful AI models in low-resource environments, which could democratize AI usage across a broader array of devices and applications. Future research may focus on further optimizing data training processes, scaling model architectures without substantial increases in size, and enhancing the model's adaptability to various real-world applications beyond chat formats.

Conclusion

Phi-3-mini presents an impressive advancement in the field of AI, achieving high performance while maintaining a small footprint suitable for mobile devices. This development underscores the importance of innovative data training strategies and model efficiency, setting a promising direction for future research in portable AI technologies.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Meta AI. Introducing meta llama 3: The most capable openly available llm to date, 2023.
  2. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
  3. Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022.
  4. Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions, 2024.
  5. Piqa: Reasoning about physical commonsense in natural language. arXiv preprint arXiv:1911.11641, 2019.
  6. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018.
  7. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  8. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936, 2019.
  9. Evaluating large language models trained on code, 2021.
  10. Longrope: Extending llm context window beyond 2 million tokens, 2024.
  11. Textbooks are all you need. arXiv preprint arXiv:2306.11644, 2023.
  12. Measuring mathematical problem solving with the MATH dataset, 2021.
  13. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  14. Phi-2: The surprising power of small language models. Microsoft Research Blog, 2023.
  15. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension, 2017.
  16. Beavertails: Towards improved safety alignment of llm via a human-preference dataset, 2023.
  17. What disease does this patient have? a large-scale open domain question answering dataset from medical exams, 2020.
  18. Mistral 7b, 2023.
  19. Mixtral of experts, 2024.
  20. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  21. Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463, 2023.
  22. Truthfulqa: Measuring how models mimic human falsehoods, 2022.
  23. Can a suit of armor conduct electricity? a new dataset for open book question answering, 2018.
  24. Scaling data-constrained language models. arXiv preprint arXiv:2305.16264, 2023.
  25. Adversarial nli: A new benchmark for natural language understanding, 2020.
  26. Gpqa: A graduate-level google-proof q&a benchmark, 2023.
  27. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  28. Winogrande: An adversarial winograd schema challenge at scale. arXiv preprint arXiv:1907.10641, 2019.
  29. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
  30. Challenging big-bench tasks and whether chain-of-thought can solve them, 2022.
  31. Commonsenseqa: A question answering challenge targeting commonsense knowledge, 2019.
  32. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  33. Gemma: Open models based on gemini research and technology, 2024.
  34. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, 2017.
  35. Agieval: A human-centric benchmark for evaluating foundation models, 2023.
  36. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
  37. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (129)
  1. Marah Abdin (5 papers)
  2. Sam Ade Jacobs (9 papers)
  3. Ammar Ahmad Awan (15 papers)
  4. Jyoti Aneja (9 papers)
  5. Ahmed Awadallah (27 papers)
  6. Hany Awadalla (8 papers)
  7. Nguyen Bach (13 papers)
  8. Amit Bahree (1 paper)
  9. Arash Bakhtiari (5 papers)
  10. Harkirat Behl (9 papers)
  11. Alon Benhaim (11 papers)
  12. Misha Bilenko (2 papers)
  13. Johan Bjorck (16 papers)
  14. Sébastien Bubeck (90 papers)
  15. Martin Cai (3 papers)
  16. Caio César Teodoro Mendes (6 papers)
  17. Weizhu Chen (128 papers)
  18. Vishrav Chaudhary (45 papers)
  19. Parul Chopra (3 papers)
  20. Allie Del Giorno (4 papers)
Citations (568)
Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews

  1. Phi-3 Technical Report (410 points, 129 comments)
Reddit Logo Streamline Icon: https://streamlinehq.com