OpenBezoar: Small, Cost-Effective and Open Models Trained on Mixes of Instruction Data (2404.12195v1)

Published 18 Apr 2024 in cs.CL and cs.LG

Abstract: Instruction fine-tuning pretrained LLMs for diverse downstream tasks has demonstrated remarkable success and has captured the interest of both academics and practitioners. To ensure such fine-tuned LLMs align with human preferences, techniques such as RLHF and DPO have emerged. At the same time, there is increasing interest in smaller parameter counts for models. In this work, using OpenLLaMA 3Bv2 as a base model, we describe the recipe used to fine-tune the OpenBezoar family of models. In this recipe: We first generate synthetic instruction fine-tuning data using an open and commercially non-restrictive instruction fine-tuned variant of the Falcon-40B model under three schemes based on: LaMini-LM, WizardLM/Evol-Instruct (with databricks-dolly-15k as a seed dataset) and Orca (with the Flan Collection as a seed dataset), then filter these generations using GPT-4 as a human proxy. We then perform cost-effective QLoRA-based supervised fine-tuning sequentially with each scheme. The resulting checkpoint is further fine-tuned with a subset of the HH-RLHF dataset to minimize distribution shift prior to using the DPO loss to obtain the final checkpoint. Evaluation is done with the LM Eval Harness tasks/metrics as well as on MT-Bench using the "LLM-as-a-judge" framework with Claude 2.1, with the finding that the final checkpoint, "OpenBezoar-HH-RLHF-DPO", demonstrates superior performance over many models at the 3B parameter scale, even outperforming the top model in one of the categories on the Huggingface Open LLM Leaderboard. We release "OpenBezoar-SFT", "OpenBezoar-HH-RLHF-SFT", "OpenBezoar-HH-RLHF-DPO" checkpoints, alongside our generated datasets on HuggingFace at https://huggingface.co/collections/SurgeGlobal/open-bezoar-6620a24923e12127e9e2b9cc and our codebase at https://bitbucket.org/paladinanalytics/workspace/projects/OP.

PDF Abstract

Evaluating Instruction-Following LLMs Fine-Tuned on Synthetic Datasets with Direct Preference Optimization

Introduction to Model Fine-Tuning

The intense computational demands and costs associated with training LLMs from scratch have necessitated the exploration of fine-tuning as a practical alternative. Supervised Fine-Tuning (SFT) of pre-trained models like OpenLLaMA 3B v2 has proven effective for task specialization across a variety of downstream applications. In parallel, there is a growing interest in adapting smaller LLMs to produce high-quality output without the prohibitive resource requirements often associated with larger models. This exploration encompasses the generation of synthetic datasets for instruction-following tasks, utilizing open-source models that permit full commercial utilization of generated data.

Dataset Generation and Model Training

Synthetic Instruction Data Generation

Instruction fine-tuning datasets were generated using three schemes:

LaMini: Leveraged a non-restrictive Falcon-40B model variant to generate instructions based on set examples and topics.
Evol-Instruct: Employed an iterative method to evolve initial datasets into more complex forms by adjusting task depth and breadth.
Orca: Utilized explanation tuning on response-query pairs to promote detailed understanding and reasoning in model responses.

These datasets serve as the foundation for training the models under paper. Post-generation, high-quality filtering through GPT-4 ensured relevance and coherence in context to instruction-following.

Model Fine-Tuning Using QLoRA

Quantized Low Rank Adaptation (QLoRA) was applied to sequentially fine-tune the OpenLLaMA 3B base model with each synthetic dataset. Subsequent alignment with human preferences was pursued via Direct Preference Optimization (DPO) using the HH-RLHF dataset, a process refined by conditioning on human-rated response preferences.

Evaluation Strategies

Benchmark Performance

Constructed models underwent rigorous evaluation using the established LM Eval Harness tasks/metrics, which verified their comparative performance at scale. Furthermore, models' abilities to align with human judgment were quantified using MT-Bench and the "LLM-as-a-judge" framework, employing Anthropic's Claude 2.1 as a benchmark.

Direct Preference Optimization

Following QLoRA-based SFT, the model's alignment to human preferences was fine-tuned through DPO, eliminating the need for a separate reward model and implicitly tuning model outputs directly from preference data.

Results and Insights

The final models, especially "OpenBezoar-HH-RLHF-DPO," demonstrated superior alignment to human preferences, evidenced by robust performance across MT-Bench assessments. This checkpoint notably excelled in conversational tasks, surpassing the performance of other similar-scale models in selective categories. The sequential application of QLoRA and DPO has proven effective, not only in refining models' instructional response quality but also in enhancing their ability to adhere to nuanced human judgments.

Future Work

Further investigations could focus on:

Enhancing dataset diversity and curation to better cover the space of instructional tasks.
Exploring more efficient model merging techniques post QLoRA-based fine-tuning.
Extending DPO training beyond a single epoch to potentially uncover further gains in model alignment with human preferences.

Overall, this research underscores the utility of synthetic data and advanced fine-tuning techniques in crafting LLMs that are both performative and aligned closely with human evaluative standards.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Chandeepa Dissanayake (3 papers)
Lahiru Lowe (1 paper)
Sachith Gunasekara (1 paper)
Yasiru Ratnayake (1 paper)

Related Papers

Find Related Papers

Tweets

https://twitter.com/_akhaliq/status/1781131738950828068

https://twitter.com/SwankyView/status/1804578931296666013

https://twitter.com/SwankyView/status/1785372381206151631