SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models (2403.07384v2)

Published 12 Mar 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Despite the effectiveness of data selection for LLMs during pretraining and instruction fine-tuning phases, improving data efficiency in supervised fine-tuning (SFT) for specialized domains poses significant challenges due to the complexity of fine-tuning data. To bridge this gap, we introduce an effective and scalable data selection method for SFT, SmallToLarge (S2L), which leverages training trajectories from small models to guide the data selection for larger models. We demonstrate through extensive experiments that S2L significantly improves data efficiency in SFT for mathematical problem-solving, reducing the training data to just 11% of the original MathInstruct dataset (Yue et al., 2023) to match full dataset performance while outperforming state-of-the-art data selection algorithms by an average of 4.7% across 6 in- and out-domain evaluation datasets. Remarkably, selecting only 50K data for SFT, S2L achieves a 32.7% accuracy on the most challenging MATH (Hendrycks et al., 2021) benchmark, improving Phi-2 (Li et al., 2023b) by 16.6%. In clinical text summarization on the MIMIC-III dataset (Johnson et al., 2016), S2L again outperforms training on the full dataset using only 50% of the data. Notably, S2L can perform data selection using a reference model 40x smaller than the target model, proportionally reducing the cost of data selection.

PDF HTML Abstract

Scalable Data Selection for Efficient Fine-tuning of LLMs

The paper "S2L: Scalable Data Selection for Fine-tuning LLMs by Summarizing Training Trajectories of Small Models" presents an innovative approach to optimizing data selection for supervised fine-tuning (SFT) of LLMs in specialized domains. The authors highlight the prevalent challenges associated with data efficiency in these models, especially when transitioning from generalist capabilities to domain-specific expertise. The introduction of {S2L}—a method that leverages training trajectories from smaller models—addresses these challenges effectively.

Problem Statement and Methodology

The paper begins with the identification of a significant gap in data efficiency during SFT, particularly for specialized domains where data distributions differ markedly from the pretraining distributions. Existing data selection methods often fall short in these scenarios due to their reliance on generalist models or less specialized selection criteria.

{S2L} distinguishes itself through a scalable data selection process that capitalizes on identifying and clustering training trajectory patterns from smaller, proxy models. This approach is designed on the foundation that training dynamics tend to be consistent across models of different scales, as evidenced by Xia et al. (2023). By summarizing these dynamics, {S2L} efficiently selects a subset of data that ensures comprehensive topic and pattern coverage.

Results and Analysis

The experimental results are robust and compelling. Utilizing the MathInstruct dataset, the authors demonstrate that {S2L} can achieve performance parity using only 11% of the full dataset. Notably, {S2L} outperforms state-of-the-art data selection methods by an average margin of 4.7% across multiple tasks, including both in-domain and out-of-domain datasets. For example, on the challenging MATH benchmark, {S2L} achieves a remarkable 32.7% accuracy, significantly enhancing performance over existing models.

In terms of scalability, {S2L}'s ability to perform data selection with models up to 40 times smaller than the target model is noteworthy, yielding substantial reductions in computational expenses. This scalability is empirically validated by successfully transferring data subsets to larger models like Phi-2 (2.7B), demonstrating {S2L}'s cross-model applicability.

Implications and Future Directions

From a practical perspective, {S2L} presents a cost-effective solution for practitioners aiming to fine-tune LLMs for specialized applications, such as in mathematical reasoning and clinical text summarization. The success of {S2L} in these domains suggests its potential utility across other specialized fields, paving the way for more efficient use of training resources and energy.

Theoretically, the method opens avenues for further exploration into the uniformity of training dynamics across models and tasks, inviting research into the underlying mechanisms that ensure this consistency. Additionally, potential enhancements could explore automated adjustments to trajectory clustering parameters or adaptive sampling strategies that respond dynamically to the complexity of fine-tuning tasks.

Conclusion

The introduction of {S2L} provides a significant step toward optimizing data efficiency during the fine-tuning phase of LLM development. By capitalizing on the training trajectories of smaller models, this strategy not only reduces the data volume required for high performance but also achieves scalability across different model sizes. This makes {S2L} an invaluable tool for pushing the boundaries of LLM capabilities in specialized domains without incurring excessive computational costs. As AI continues to evolve, such methods will be integral to maintaining sustainable and efficient model training practices.

PDF Markdown Bookmark Chat (Pro)

References (71)

Authors (4)

Yu Yang (213 papers)
Siddhartha Mishra (76 papers)
Baharan Mirzasoleiman (51 papers)
Jeffrey N Chiang (1 paper)

Citations (10)

View on Semantic Scholar

Tweets

https://twitter.com/YUYANG_UCLA/status/1865919409879175678