Instruct-SkillMix: A Powerful Pipeline for LLM Instruction Tuning (2408.14774v3)

Published 27 Aug 2024 in cs.LG and cs.CL

Abstract: We introduce Instruct-SkiLLMix, an automated approach for creating diverse, high quality SFT data. The Instruct-SkiLLMix pipeline involves two stages, each leveraging an existing powerful LLM: (1) Skill extraction: uses the LLM to extract core "skills" for instruction-following, either from existing datasets, or by directly prompting the model; (2) Data generation: uses the powerful LLM to generate (instruction, response) data that exhibit a randomly chosen pair of these skills. Here, the use of random skill combinations promotes diversity and difficulty. Vanilla SFT (i.e., no PPO, DPO, or RL methods) on data generated from Instruct-SkiLLMix leads to strong gains on instruction following benchmarks such as AlpacaEval 2.0, MT-Bench, and WildBench. With just $4$K examples, LLaMA-3-8B-Base achieves 42.76% length-controlled win rate on AlpacaEval 2.0. To our knowledge, this achieves state-of-the-art performance among all models that have only undergone SFT (no RL methods) and competes with proprietary models such as Claude 3 Opus and LLaMA-3.1-405B-Instruct. Ablation studies also suggest plausible reasons for why creating open instruction-tuning datasets via naive crowd-sourcing has proved difficult. Introducing low quality answers ("shirkers") in $20\%$ of Instruct-SkiLLMix examples causes performance to plummet, sometimes catastrophically. The Instruct-SkiLLMix pipeline is flexible and is adaptable to other settings.

PDF HTML Abstract

An Analysis of the Instruct-SkiLLMix Pipeline for LLM Instruction Tuning

The paper Instruct-SkiLLMix: A Powerful Pipeline for LLM Instruction Tuning, presents a methodological advance in the generation of high-quality supervised fine-tuning (SFT) data for LLMs. The core innovation is the Instruct-SkiLLMix (Instruct-SM) pipeline, which automates the process of data creation to enhance the performance of LLMs on instruction-following tasks. This pipeline is devised in two stages: skill extraction and data generation, utilizing a frontier LLM to ensure high diversity and quality.

Overview of the Methodology

Skill Extraction

The skill extraction stage is executed using two distinct approaches:

Leveraging Existing Instruction Datasets: Here, the system gleans skills from established datasets like Alpaca-52K and UltraChat. The procedure is inspired by meta-cognitive evaluation techniques, aiming to discern a comprehensive set of instruction-following skills within these datasets.
Direct Prompting of a Powerful LLM: This involves querying a powerful LLM (e.g., GPT-4-Turbo) to autonomously identify critical skills necessary for high-quality instruction following, focusing on extrapolating diverse skill categories.

From these methods, specific "skill clusters" are identified and subsequently used to direct the generation of new synthetic data.

Data Generation

In the data generation phase, the LLM combines pairs of randomly selected skills to generate (instruction, response) pairs. This approach ensures a rich combinatorial diversity due to the multiplicative nature of the skills' combinations. The generated dataset, referred to as Instruct-SM, is then utilized to fine-tune base models, aiming to enhance their instruction-following performance without the need for further reinforcement learning techniques.

Numerical Results and Performance

The paper reports strong empirical results substantiating the efficacy of the Instruct-SM pipeline. With as few as 4,000 examples, the models achieve competitive performance against state-of-the-art models on benchmarks like AlpacaEval 2.0, MT-Bench, and WildBench.

AlpacaEval 2.0: A length-controlled win-rate (LC WR) of 42.76% was achieved using the Instruct-SM generated data, which is on par with proprietary models such as Claude 3 Opus and LLaMA-3.1-405B-Instruct.
MT-Bench: The Instruct-SM pipeline also demonstrated significant improvement on the MT-Bench evaluation.
WildBench: Instruct-SM data led to outperforming benchmarks like Claude 3 Sonnet and Mistral Large.

Theoretical and Practical Implications

Theoretical Implications:

The success of the Instruct-SM pipeline underscores the critical role of skill specificity and quality in generating effective instruction-following datasets. By focusing on skill extraction and their synthetic combinations, this paper supports the hypothesis that precise, skill-targeted data can significantly improve model performance.

Practical Implications:

The pipeline provides a scalable and efficient methodology for generating high-quality SFT data, essential for tuning base LLMs to high-performance instruction-following models. This approach reduces reliance on costly and labor-intensive human-annotated datasets, presenting an accessible pathway for academic and open-source communities to develop competitive instruction-focused LLMs.

Future Directions

The promising results open up several avenues for future research and development:

Extending the Instruct-SM pipeline's skill extraction capability to cover more specialized domains, such as mathematical problem solving, alignment, and safety in AI.
Integrating the Instruct-SM pipeline with reinforcement learning techniques to push the boundaries of instruction-following performance further.
Exploring the potential of multi-skill interactions beyond pairs to understand composite skill dynamics and their impact on LLM capabilities.

Conclusion

The Instruct-SkiLLMix (Instruct-SM) pipeline offers a powerful and efficient approach for creating diverse, high-quality instruction-following datasets, demonstrating significant performance improvements on established benchmarks. By capitalizing on automated skill extraction and systematic data generation, this research provides valuable insights and tools for advancing the state of LLM fine-tuning practices, potentially reshaping future methodologies in AI training and deployment.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Simran Kaur (8 papers)
Simon Park (4 papers)
Anirudh Goyal (93 papers)
Sanjeev Arora (93 papers)

Citations (4)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/kaur_simran25/status/1831385872866144664

https://twitter.com/GptMaestro/status/1830180407691919671