Smaller Language Models are capable of selecting Instruction-Tuning Training Data for Larger Language Models (2402.10430v1)

Published 16 Feb 2024 in cs.CL

Abstract: Instruction-tuning LLMs has become a crucial step in aligning them for general use. Typically, this process involves extensive training on large datasets, incurring high training costs. In this paper, we introduce a novel training data selection based on the learning percentage of the samples. We assert that current LLMs possess the capability to autonomously select high-quality training data, leading to comparable or improved performance compared to training on the entire dataset. Our experiments span different-sized models, revealing that this characteristic holds for models ranging from 1B (small) to 13B (large) in size. Moreover, we demonstrate an interesting finding that the data hardness transfers across model sizes, and a smaller 350M model can effectively curate high-quality training data with hard samples for a larger 13B model, resulting in an equally or superior instruction-tuned model compared to training on the complete dataset. Utilizing open-sourced OPT and Llama-2 models up to 13B in size, two publicly available instruction-tuning training datasets and evaluated by both automatic metrics & humans, our paper introduces a novel approach to training data selection, showcasing a more efficient alternative.

PDF Abstract

Overview of "Smaller LLMs are Capable of Selecting Instruction-Tuning Training Data for Larger LLMs"

The paper presents a focused investigation into the capacity of smaller LLMs to autonomously select high-quality training data for the instruction tuning of larger models. The authors assert that instruction-tuning has become a crucial step in facilitating generalized task performance in LLMs, yet it demands extensive training on large datasets, leading to elevated costs. The paper introduces a novel method for training data selection based on the learning percentage of samples, challenging the prevailing belief that extensive datasets are essential for effective model instruction.

Key Findings and Methodology

The paper's core proposition is that smaller models can curate high-quality instructional data that leads to the comparable or superior performance of larger models, even when only a fraction of the complete dataset is utilized:

Data Selection via Learning Percentage: Leveraging the learning percentage (LP) as a difficulty metric, the paper proposes that samples learned more extensively during earlier epochs are easier. Therefore, using the LP metric can prioritize the selection of more challenging samples for training, ultimately leading to better generalization performance in LLMs.
Empirical Validation Across Models: The researchers conducted extensive experiments involving both open-source models, such as OPT and Llama-2, and publicly available datasets like Alpaca-Data and Dolly. The experiments showed this novel data selection approach is effective across different model sizes—from small (1B) to large (13B). Notably, results indicated that smaller models (350M parameters) could discern and select harder samples effectively for a 13B parameter model, leading to equal or improved performance relative to using the entire dataset.
Transferability of Data Hardness: An interesting observation made by the authors is that data hardness is transferable across model sizes. Samples categorized as difficult by smaller models are also challenging for larger models. This finding suggests the potential for significant cost reductions in model training, with larger models needing progressively fewer samples as their size increases.

Practical and Theoretical Implications

The research proposes significant advancements in the practical efficiency of instruction tuning. By allowing smaller models to pre-select the most beneficial training data, the process becomes more resource-efficient and less computationally intensive, paving the way for cheaper and faster development of state-of-the-art LLMs.

On a theoretical level, the methodology bridges an important gap between model scale and data efficiency. The finding that data hardness attributes are transferable deepens our understanding of LLM training dynamics, suggesting that more sophisticated metrics like the learning percentage can unveil new layers of training complexity.

Future Directions

This novel approach to data selection could steer future developments towards hyper-efficient training regimens where larger models reap benefits by utilizing economically curated datasets. Subsequent research could explore refining these selection metrics, exploring automatic transformation strategies to increase sample difficulty, and further examining the causes and mitigation strategies for noisy sample influences within curated datasets.

In summary, this paper contributes a valuable perspective to the ongoing discourse on optimal LLM training practices, presenting a framework where smaller models confer efficiencies and form the vanguard in data curation for more expansive model architectures.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Dheeraj Mekala (19 papers)
Alex Nguyen (9 papers)
Jingbo Shang (141 papers)

Citations (12)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/MekalaDheeraj/status/1781794263862268150

https://twitter.com/MekalaDheeraj/status/1781154863625236741

https://twitter.com/MekalaDheeraj/status/1759978680624074976

https://twitter.com/_VatsaDev_/status/1782604489637462257