AlpaGasus: Training A Better Alpaca with Fewer Data (2307.08701v5)

Published 17 Jul 2023 in cs.CL

Abstract: LLMs strengthen instruction-following capability through instruction-finetuning (IFT) on supervised instruction/response data. However, widely used IFT datasets (e.g., Alpaca's 52k data) surprisingly contain many low-quality instances with incorrect or irrelevant responses, which are misleading and detrimental to IFT. In this paper, we propose a simple and effective data selection strategy that automatically identifies and filters out low-quality data using a strong LLM (e.g., ChatGPT). To this end, we introduce AlpaGasus, which is finetuned on only 9k high-quality data filtered from the 52k Alpaca data. AlpaGasus significantly outperforms the original Alpaca as evaluated by GPT-4 on multiple test sets and the controlled human evaluation. Its 13B variant matches $>90\%$ performance of its teacher LLM (i.e., Text-Davinci-003 generating the 52k data) on test tasks. It also provides 5.7x faster training, reducing the training time for a 7B variant from 80 minutes (for Alpaca) to 14 minutes. Moreover, the experiments prove the efficacy of our method across diverse datasets, base models, and LLM filters. Overall, AlpaGasus demonstrates a novel data-centric IFT paradigm that can be generally applied to instruction-tuning data, leading to faster training and better instruction-following models. Our project page is available at: https://lichang-chen.github.io/AlpaGasus/

PDF Abstract

AlpaGasus: Streamlining Instruction Fine-Tuning through Data Quality Enhancement

In the context of rapidly evolving LLMs, the paper "AlpaGasus: Training A Better Alpaca with Fewer Data" introduces an innovative approach to instruction fine-tuning (IFT) that prioritizes data quality over quantity. The paper proposes a refined data selection methodology, wherein a subset of high-quality data is culled from existing datasets using a strong LLM, such as ChatGPT, to evaluate and exclude low-quality samples. This technique underscores a profound shift in the paradigm of AI research, particularly in the field of LLM training, advocating for a data-centric approach to optimize both performance and efficiency.

Core Contributions

The primary contribution of this paper lies in the introduction of AlpaGasus, a framework that underscores the efficacy of data quality over large datasets that are often burdened with erroneous entries. By employing a stringent filtering mechanism, only 9,229 samples were retained from an original pool of 52,000. This subset was classified using a metric-based evaluation anchored by a strong LLM, setting a threshold that ensured only high-grade data informed the fine-tuning process.

Methodological Insights

The authors utilized a robust method to shortlist data based on scores derived from model responses. This score was generated by prompting an evaluator LLM to rate response tuples on dimensions such as accuracy or helpfulness. The threshold was set at 4.5 for accuracy, allowing them to distill the Alpaca's training data from 52k to a compact yet potent set of 9k, which nevertheless delivered superior performance metrics than its predecessor. This stratagem balanced the model’s training on model-agnostic datasets, mitigating training cost and duration substantially—offering a fivefold reduction in some instances.

Numerical Results and Implications

Empirical evaluations revealed that the models trained on this curated dataset not only matched but often exceeded the performance of those trained on the expansive original dataset. Notably, the 13B variant of AlpaGasus, when contrasted with its teacher LLM, achieved over 90% effectiveness on multiple test sets. Such results confirm the foundational hypothesis: emphasizing fewer, high-quality data points can enhance LLMs far more effectively than larger datasets riddled with noise.

Impact and Future Directions

The implications of adopting a data-centric approach are manifold. By focusing on the authenticity and precision of training data, researchers can alleviate unnecessary computational burdens and cost while simultaneously achieving high-performance benchmarks. This work advocates for a methodological shift across the broader landscape of AI, suggesting that as models grow in complexity, the proportional increase in raw data may not be requisite if that data is inherently flawed or redundant.

In the future, these findings may catalyze further exploration into automated data grading systems embedded within the learning pipelines, ensuring real-time evaluation and enhancement of datasets as they continue to evolve. Moreover, potential adaptations across differing datasets, including those generated by human efforts or other generative models, represent an intriguing next step for AI research and development.

Through AlpaGasus, the authors propose a pragmatic direction for scaling LLMs — efficient utilization of high-quality data rather than sheer volume, setting a precedent for data selection methodologies that prioritize sophistication and reliability in training datasets. This model promises efficiency, improved alignment with human-like instruction-following capabilities, and a substantial reduction in environmental impact due to lesser computational demands. Such developments stand to advance state-of-the-art LLMs in ways that are both economically viable and ethically conscious.