Phi-3 Safety Post-Training: Aligning Language Models with a "Break-Fix" Cycle (2407.13833v2)

Published 18 Jul 2024 in cs.CL and cs.AI

Abstract: Recent innovations in LLM training have demonstrated that it is possible to create highly performant models that are small enough to run on a smartphone. As these models are deployed in an increasing number of domains, it is critical to ensure that they are aligned with human preferences and safety considerations. In this report, we present our methodology for safety aligning the Phi-3 series of LLMs. We utilized a "break-fix" cycle, performing multiple rounds of dataset curation, safety post-training, benchmarking, red teaming, and vulnerability identification to cover a variety of harm areas in both single and multi-turn scenarios. Our results indicate that this approach iteratively improved the performance of the Phi-3 models across a wide range of responsible AI benchmarks. Finally, we include additional red teaming strategies and evaluations that were used to test the safety behavior of Phi-3.5-mini and Phi-3.5-MoE, which were optimized for multilingual capabilities.

PDF HTML Abstract

Safety Alignment of the Phi-3 LLMs: An Iterative Break-Fix Approach

In this paper, researchers from Microsoft present their methodology for aligning the Phi-3 series of LLMs (LMs) with safety and human preferences via an iterative "break-fix" cycle. The paper underscores the critical importance of ensuring that small LLMs (SLMs), which can operate on devices with limited computational power, adhere to stringent safety standards. The authors meticulously describe their multi-stage approach, quantitative benchmarks, red teaming initiatives, and comprehensive Responsible AI (RAI) evaluations, providing a robust framework for developing safer and more reliable LLMs.

Key Aspects of the Safety Alignment Process

The researchers adopt an iterative safety post-training methodology that involves the following five main stages:

Safety Dataset Curation: Leveraging both publicly available datasets and in-house generated datasets, optimized based on feedback from the AI Red Team.
Safety Post-Training: Employing supervised fine-tuning (SFT) and direct preference optimization (DPO) to integrate safety-related adjustments into the models.
Quantitative and Qualitative RAI Evaluations: Performing comprehensive evaluations to select potential release candidates.
AI Red Teaming: It involves centralized, independent probing by the AI Red Team using various adversarial techniques to identify harmful content.
Vulnerability Identification: Analyzing feedback from RAI evaluations and red teaming to pinpoint vulnerabilities for further mitigation efforts.

The iterative nature of this approach ensures continuous improvements and makes it possible to identify and address a wider range of real-world risks than a single round of fine-tuning would allow.

Quantitative Benchmarks and Red Teaming Insights

The paper reports extensive evaluations of the Phi-3 models against several benchmarks, illustrating the effectiveness of the break-fix cycle:

Microsoft Internal Automated Measurement: Utilizing multi-turn conversation simulations with adversarial AI agents, the evaluation covers scenarios like grounding, third-party content, harmful content continuation and summarization, and jailbreak frequency. The Phi-3 models demonstrate scores better than or comparable to the benchmark competitors (e.g., Mistral-7B, Gemma-7B, Llama-3-In) across all evaluated categories.
XSTest: This public dataset measures two crucial metrics—Inappropriate Prompt Refusal Rate (IPRR) and Valid Prompt Refusal Rate (VPRR)—to assess the model's ability to avoid harmful prompts while answering safe ones. The results reflect a notable balance where the Phi-3-small model closely matches the performance of Gemma-7B.
DecodingTrust: Covering a broad spectrum of trustworthiness metrics like stereotype bias, robustness, privacy, machine ethics, and fairness, this benchmark reveals that Phi-3 models surpass the competitors in various aspects, highlighting their robust performance on language tasks.
ToxiGen: Specifically designed to detect implicit hate speech, this benchmark confirms that Phi-3 models excel in recognizing harmful content over Mistral and Gemma series.

Implications and Future Directions

The findings indicate that the break-fix approach substantively mitigates harmful content generation, bolstering both practical deployment in responsible AI systems and theoretical advancements in safety alignment techniques. Iterative red teaming and continuous vulnerability assessment emerge as indispensable components for developing SLMs capable of functioning safely in diverse, real-world environments.

However, the authors recognize the intrinsic limitations of LLMs, including potential biases, quality of service concerns across languages, perpetration of stereotypes, and the risks of producing inaccurate or inappropriate content. They stress the importance of responsible downstream development, recommending that developers incorporate additional safety classifiers and adhere to relevant laws and guidelines to tailor the models to their specific contexts.

Conclusion

The extensive methodology and rigorous evaluations presented in this paper underscore the viability and effectiveness of an iterative break-fix cycle for safety aligning LLMs. While the Phi-3 models demonstrate significant improvements in safety benchmarks, ongoing efforts and robust deployment practices remain essential to address emerging risks and improve the trustworthiness of AI systems.

By providing a detailed overview of the authors' methodologies and results, this essay elucidates the importance of iterative processes and comprehensive evaluations in strengthening the safety and reliability of LLMs. The paper’s findings pave the way for future research and practical implementations aimed at striking a critical balance between performance and responsible AI deployment.

PDF Markdown Bookmark Chat (Pro)

Authors (31)

Emman Haider (2 papers)
Daniel Perez-Becker (7 papers)
Thomas Portet (6 papers)
Piyush Madan (9 papers)
Amit Garg (5 papers)
David Majercak (3 papers)
Wen Wen (24 papers)
Dongwoo Kim (63 papers)
Ziyi Yang (77 papers)
Jianwen Zhang (20 papers)
Hiteshi Sharma (12 papers)
Blake Bullwinkel (7 papers)
Martin Pouliot (3 papers)
Amanda Minnich (2 papers)
Shiven Chawla (3 papers)
Solianna Herrera (1 paper)
Shahed Warreth (3 papers)
Maggie Engler (2 papers)
Gary Lopez (3 papers)
Nina Chikanov (3 papers)

Citations (4)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/_akhaliq/status/1815210761439871026

https://twitter.com/kardelanite/status/1823435359575007510

https://twitter.com/Kseniase_/status/1818345849539399682

https://twitter.com/cackerman21/status/1824874932905693509