Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone (2404.14219v4)

Published 22 Apr 2024 in cs.CL and cs.AI

Abstract: We introduce phi-3-mini, a 3.8 billion parameter LLM trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. Our training dataset is a scaled-up version of the one used for phi-2, composed of heavily filtered publicly available web data and synthetic data. The model is also further aligned for robustness, safety, and chat format. We also provide parameter-scaling results with a 7B, 14B models trained for 4.8T tokens, called phi-3-small, phi-3-medium, both significantly more capable than phi-3-mini (e.g., respectively 75%, 78% on MMLU, and 8.7, 8.9 on MT-bench). To enhance multilingual, multimodal, and long-context capabilities, we introduce three models in the phi-3.5 series: phi-3.5-mini, phi-3.5-MoE, and phi-3.5-Vision. The phi-3.5-MoE, a 16 x 3.8B MoE model with 6.6 billion active parameters, achieves superior performance in language reasoning, math, and code tasks compared to other open-source models of similar scale, such as Llama 3.1 and the Mixtral series, and on par with Gemini-1.5-Flash and GPT-4o-mini. Meanwhile, phi-3.5-Vision, a 4.2 billion parameter model derived from phi-3.5-mini, excels in reasoning tasks and is adept at handling both single-image and text prompts, as well as multi-image and text prompts.

PDF HTML Abstract

Highly Capable LLM for Local Deployment on Mobile Devices

Introduction to phi-3-mini

In this technical report, Microsoft researchers introduced phi-3-mini, a LLM with 3.8 billion parameters, trained on 3.3 trillion tokens. Despite its relatively small size, phi-3-mini demonstrates comparable performance to significantly larger models such as Mixtral 8x7B and GPT-3.5 on standard benchmarks like MMLU and MT-bench. This achievement is primarily ascribed to its sophisticated training data, which integrates filtered web data and synthetic data. The model is adapted to run efficiently on modern mobile phones, offering local, offline linguistic processing capabilities that were previously unattainable without cloud computing resources.

Model Architecture and Training

Phi-3-mini leverages a transformer decoder architecture and is precisely tuned for chat-based interactions. This model is an extension of previous models like phi-2, utilizing a more refined dataset and training regimen. The mobile-friendly design is quantized to a 4-bit model, reducing its memory footprint to approximately 1.8GB, thus allowing its deployment on handheld devices without sacrificing performance. The phi-3-small and phi-3-medium models extend this architecture to 7 billion and 14 billion parameters, respectively, illustrating notable improvements in MMLU and MT-bench scores as parameters increase.

Data Strategy and Optimization

The data used for training phi-3-mini adheres to a sophisticated filtering process focused on enhancing the model's reasoning capabilities and knowledge base, while optimizing for the "data optimal regime" rather than merely scaling up compute resources. This targeted data selection process allows smaller models to perform at the levels expected of much larger models, essentially improving computational efficiency and model responsiveness.

Benchmarks and Performance Comparisons

Phi-3-mini achieved significant scores on various benchmarks:

MMLU: 69% accuracy, remarkable for its size category.
HellaSwag: Achieved 76.7%, competing closely with larger models.
ANLI: Scored 52.8%, showcasing a strong understanding of adversarial narratives.
GSM-8K: With 82.5%, it excels in generative science QA.
MT-bench: Scored 8.38, emphasizing its robust multitasking abilities.

Additionally, this model outperforms phi-2 across all benchmarks and often rivals or exceeds the capabilities of larger models, such as Mistral 7B and GPT-3.5.

Safety and Ethical Considerations

The development process for phi-3-mini included rigorous safety and ethical training to mitigate potential harms. Utilizing a combination of supervised instruction fine-tuning and preference tuning, the model was aligned with safe response generation standards. Moreover, an independent red teaming process was employed to identify and amend possible weaknesses in the model's output, ensuring a robust framework for responsible deployment.

Future Implications and Developments

The achievements of phi-3-mini suggest significant potential for deploying powerful AI models in low-resource environments, which could democratize AI usage across a broader array of devices and applications. Future research may focus on further optimizing data training processes, scaling model architectures without substantial increases in size, and enhancing the model's adaptability to various real-world applications beyond chat formats.

Conclusion

Phi-3-mini presents an impressive advancement in the field of AI, achieving high performance while maintaining a small footprint suitable for mobile devices. This development underscores the importance of innovative data training strategies and model efficiency, setting a promising direction for future research in portable AI technologies.