Evaluating Instruction-Following LLMs Fine-Tuned on Synthetic Datasets with Direct Preference Optimization
Introduction to Model Fine-Tuning
The intense computational demands and costs associated with training LLMs from scratch have necessitated the exploration of fine-tuning as a practical alternative. Supervised Fine-Tuning (SFT) of pre-trained models like OpenLLaMA 3B v2 has proven effective for task specialization across a variety of downstream applications. In parallel, there is a growing interest in adapting smaller LLMs to produce high-quality output without the prohibitive resource requirements often associated with larger models. This exploration encompasses the generation of synthetic datasets for instruction-following tasks, utilizing open-source models that permit full commercial utilization of generated data.
Dataset Generation and Model Training
Synthetic Instruction Data Generation
Instruction fine-tuning datasets were generated using three schemes:
- LaMini: Leveraged a non-restrictive Falcon-40B model variant to generate instructions based on set examples and topics.
- Evol-Instruct: Employed an iterative method to evolve initial datasets into more complex forms by adjusting task depth and breadth.
- Orca: Utilized explanation tuning on response-query pairs to promote detailed understanding and reasoning in model responses.
These datasets serve as the foundation for training the models under paper. Post-generation, high-quality filtering through GPT-4 ensured relevance and coherence in context to instruction-following.
Model Fine-Tuning Using QLoRA
Quantized Low Rank Adaptation (QLoRA) was applied to sequentially fine-tune the OpenLLaMA 3B base model with each synthetic dataset. Subsequent alignment with human preferences was pursued via Direct Preference Optimization (DPO) using the HH-RLHF dataset, a process refined by conditioning on human-rated response preferences.
Evaluation Strategies
Benchmark Performance
Constructed models underwent rigorous evaluation using the established LM Eval Harness tasks/metrics, which verified their comparative performance at scale. Furthermore, models' abilities to align with human judgment were quantified using MT-Bench and the "LLM-as-a-judge" framework, employing Anthropic's Claude 2.1 as a benchmark.
Direct Preference Optimization
Following QLoRA-based SFT, the model's alignment to human preferences was fine-tuned through DPO, eliminating the need for a separate reward model and implicitly tuning model outputs directly from preference data.
Results and Insights
The final models, especially "OpenBezoar-HH-RLHF-DPO," demonstrated superior alignment to human preferences, evidenced by robust performance across MT-Bench assessments. This checkpoint notably excelled in conversational tasks, surpassing the performance of other similar-scale models in selective categories. The sequential application of QLoRA and DPO has proven effective, not only in refining models' instructional response quality but also in enhancing their ability to adhere to nuanced human judgments.
Future Work
Further investigations could focus on:
- Enhancing dataset diversity and curation to better cover the space of instructional tasks.
- Exploring more efficient model merging techniques post QLoRA-based fine-tuning.
- Extending DPO training beyond a single epoch to potentially uncover further gains in model alignment with human preferences.
Overall, this research underscores the utility of synthetic data and advanced fine-tuning techniques in crafting LLMs that are both performative and aligned closely with human evaluative standards.