Zephyr: Direct Distillation of LM Alignment (2310.16944v1)

Published 25 Oct 2023 in cs.LG and cs.CL

Abstract: We aim to produce a smaller LLM that is aligned to user intent. Previous research has shown that applying distilled supervised fine-tuning (dSFT) on larger models significantly improves task accuracy; however, these models are unaligned, i.e. they do not respond well to natural prompts. To distill this property, we experiment with the use of preference data from AI Feedback (AIF). Starting from a dataset of outputs ranked by a teacher model, we apply distilled direct preference optimization (dDPO) to learn a chat model with significantly improved intent alignment. The approach requires only a few hours of training without any additional sampling during fine-tuning. The final result, Zephyr-7B, sets the state-of-the-art on chat benchmarks for 7B parameter models, and requires no human annotation. In particular, results on MT-Bench show that Zephyr-7B surpasses Llama2-Chat-70B, the best open-access RLHF-based model. Code, models, data, and tutorials for the system are available at https://github.com/huggingface/alignment-handbook.

PDF Abstract

Zephyr: Direct Distillation of LM Alignment

The paper "Zephyr: Direct Distillation of LM Alignment" by Lewis Tunstall and collaborators introduces a novel methodology for aligning LLMs to user intent through a distillation technique called distilled direct preference optimization (dDPO). This work aims at producing a smaller, efficient model that maintains high performance on various benchmarks, specifically targeting chat capabilities without the need for extensive human feedback annotation.

Methodology

The core of the Zephyr approach involves three primary steps:

Distilled @@@@2@@@@ (dSFT): Starting with a base model, typically Mistral-7B, the authors employ self-instruct techniques to create a large dataset of instructions and corresponding responses. These datasets are generated using a more capable teacher model, such as GPT-3.5-turbo, and then used for fine-tuning the student model. The dSFT process ensures that the base model can appropriately respond to diverse instructions generated during the self-instruct protocol.
AI Feedback (AIF) Collection: Instead of relying on human feedback, which is often costly and time-consuming, the authors utilize AI-generated feedback. This involves generating multiple responses to a prompt from different models and then ranking these responses using a teacher model like GPT-4. The ranked outputs are used to create binary preferences, which serve as training data for the distilled direct preference optimization process.
Distilled Direct Preference Optimization (dDPO): The crux of the method, dDPO, involves optimizing the student model using the preference data obtained from the AI feedback. Unlike traditional approaches like Proximal Policy Optimization (PPO), dDPO allows direct optimization without additional sampling during fine-tuning. This method leverages a derived reward function based on the preference model, simplifying the training process and yielding significant performance boosts.

Results

The paper benchmarks Zephyr-7B, aligned using the proposed methodology, against a range of models on single-turn and multi-turn chat benchmarks, including MT-Bench and AlpacaEval. The results are noteworthy:

MT-Bench: Zephyr-7B achieves a score of 7.34, surpassing larger models such as Llama2-Chat-70B (6.86) and aligning closely with proprietary models like GPT-3.5-turbo (7.94) and Claude 2 (8.06).
AlpacaEval: Zephyr-7B exhibits a win rate of 90.60%, indicating its effectiveness in user intent alignment compared to other open models.

Furthermore, the authors validate Zephyr-7B's performance on academic tasks via the Open LLM Leaderboard, involving classification tasks ARC, HellaSwag, MMLU, and TruthfulQA. Zephyr-7B consistently outperforms other 7B parameter models, demonstrating robust capabilities across these diverse evaluations.

Implications and Future Work

The implications of this research are substantial for the development of efficient, aligned LLMs. By eschewing extensive human feedback and leveraging AI-generated preferences, the authors present a scalable and less resource-intensive approach to model alignment. This method proves particularly valuable for smaller organizations or open-source communities that may lack the resources for large-scale human annotation efforts.

Theoretically, the success of dDPO opens avenues for further research into preference optimization techniques. Expanding the methodology to larger models could potentiate further improvements in alignment without prohibitive computational costs. Additionally, integrating safety considerations into the dDPO pipeline remains a critical next step, ensuring that aligned models also adhere to ethical standards and do not generate harmful outputs.

In conclusion, "Zephyr: Direct Distillation of LM Alignment" showcases a robust, efficient method for aligning LLMs to user intent using distilled direct preference optimization. The approach not only achieves high performance on various benchmarks but also sets a precedent for future work in model distillation and alignment, emphasizing scalability and resource efficiency. As AI continues to evolve, methods like dDPO will be crucial in developing aligned, safe, and performant LLMs.