Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Instruction Mining: Instruction Data Selection for Tuning Large Language Models (2307.06290v3)

Published 12 Jul 2023 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs are initially pretrained for broad capabilities and then finetuned with instruction-following datasets to improve their performance in interacting with humans. Despite advances in finetuning, a standardized guideline for selecting high-quality datasets to optimize this process remains elusive. In this paper, we first propose InstructMining, an innovative method designed for automatically selecting premium instruction-following data for finetuning LLMs. Specifically, InstructMining utilizes natural language indicators as a measure of data quality, applying them to evaluate unseen datasets. During experimentation, we discover that double descent phenomenon exists in LLM finetuning. Based on this observation, we further leverage BlendSearch to help find the best subset among the entire dataset (i.e., 2,532 out of 100,000). Experiment results show that InstructMining-7B achieves state-of-the-art performance on two of the most popular benchmarks: LLM-as-a-judge and Huggingface OpenLLM leaderboard.

The paper "Instruction Mining: When Data Mining Meets LLM Finetuning" addresses the challenge of efficiently selecting high-quality instruction-following datasets for fine-tuning LLMs. The main contribution is the development of a method termed InstructMining, designed to automatically evaluate and select premium data subsets for this purpose.

Key highlights from the paper include:

  1. InstructMining Framework: Utilizes natural language indicators (e.g., reward model scores) to estimate the quality of instruction data, facilitating the selection of high-quality subsets without requiring traditional, labor-intensive selection processes involving human oversight.
  2. Double Descent Phenomenon: Observations reveal a double descent in the LLM finetuning process, where model performance does not monotonically improve with increased dataset size. This phenomenon indicates that after a certain data threshold, the quality's contribution diminishes relative to dataset quantity.
  3. BlendSearch Application: To optimize the subset selection process, the paper employs BlendSearch, an efficient hyperparameter search strategy, to discover the optimal subset size for fine-tuning through automated balancing between data quality and quantity.
  4. Empirical Results:
    • InstructMining-7B Performance: Achieved state-of-the-art results on popular benchmarks like LLM-as-a-judge and Huggingface OpenLLM benchmarks, thus validating its effectiveness.
    • Efficiency: Demonstrated significant efficiency by reducing training data to 2.5% of a typical dataset size (i.e., selecting 2,532 samples out of 100,000) while maintaining performance metrics competitive to those obtained with a larger dataset.
  5. Statistical Parameterization: The paper leverages a multivariate regression framework using selected natural language indicators to predict data quality, supporting automatic evaluation without computationally expensive full model finetunings.
  6. Robustness and Application: The framework is tested across various settings, indicating its applicability to different base models, model sizes, and parameter-efficient training methods like LoRA.
  7. Significance of Indicators: In the quality evaluation space, indicators like reward scores, understandability, and naturalness are highlighted as pivotal, with specific indicators (e.g., reward score) showing robust significance in the regression models.

The paper positions InstructMining as an impactful framework that integrates classical data mining approaches into LLM finetuning, emphasizing effectively aligning large models with instruction-following capabilities efficiently and cost-effectively.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Yihan Cao (14 papers)
  2. Yanbin Kang (3 papers)
  3. Chi Wang (93 papers)
  4. Lichao Sun (186 papers)
Citations (20)