Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

80 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

43 tokens/sec

o3 Pro

7 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

764 1

A Survey on Data Selection for Language Models (2402.16827v3)

Published 26 Feb 2024 in cs.CL and cs.LG

Abstract: A major factor in the recent success of LLMs is the use of enormous and ever-growing text datasets for unsupervised pre-training. However, naively training a model on all available data may not be optimal (or feasible), as the quality of available text data can vary. Filtering out data can also decrease the carbon footprint and financial costs of training models by reducing the amount of training required. Data selection methods aim to determine which candidate data points to include in the training dataset and how to appropriately sample from the selected data points. The promise of improved data selection methods has caused the volume of research in the area to rapidly expand. However, because deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive, few organizations have the resources for extensive data selection research. Consequently, knowledge of effective data selection practices has become concentrated within a few organizations, many of which do not openly share their findings and methodologies. To narrow this gap in knowledge, we present a comprehensive review of existing literature on data selection methods and related research areas, providing a taxonomy of existing approaches. By describing the current landscape of research, this work aims to accelerate progress in data selection by establishing an entry point for new and established researchers. Additionally, throughout this review we draw attention to noticeable holes in the literature and conclude the paper by proposing promising avenues for future research.

PDF HTML Abstract

Comprehensive Review on Data Selection Methods for LLMs

Introduction to Data Selection in Machine Learning

Data selection is a pivotal aspect of the machine learning pipeline, particularly relevant in the age of LLMs which are trained on massive, heterogeneous corpora. Selecting the right data for training these models is not a straightforward task—it involves identifying which subsets of data will lead to the best model performance in terms of accuracy, efficiency, and fairness. The challenge lies not only in handling the sheer volume of available data but also in mitigating the variance in its quality.

Taxonomy of Data Selection Methods

A broad classification of data selection practices can be encapsulated into two primary goals: matching the distribution of the training data to the target task (distribution matching) and enhancing the coverage and diversity of the dataset (distribution diversification). Both approaches have their applications, with the former being crucial for domain-specific tasks requiring high precision, and the latter for general-purpose models necessitating robustness and broad applicability.

The process of data selection comprises several strategic components, notably:

Utility Function Definition: This involves mapping data points to a numeric value representing their utility, which is crucial for filtering and prioritizing data.
Selection Mechanism: Utilized to decide which data points are included in the training set based on their assigned utility values.
Dataset Characteristics Adjustment: Methods under this category operate on altering the dataset's distribution to favor certain characteristics deemed desirable for the training objectives.

Pretraining Data Selection

For pretraining LLMs, the goal is often to filter and curate data from extensive datasets like the Common Crawl corpus, ensuring the removal of low-quality or irrelevant information while retaining high-quality content. Various heuristic approaches are employed for this purpose, alongside more sophisticated model-based and perplexity-based quality filtering. The challenge is to achieve a balance that favors data efficiency and model performance without introducing significant biases.

Enhancing LLM Performance through Specific Data Selection Techniques

Fine-tuning and Multitask Learning: These methods leverage auxiliary datasets or diverse tasks to improve model performance on specific targets or across a multitude of tasks. The emphasis here is on domain-specific selection, where additional data is judiciously chosen to closely mirror the task at hand.
In-Context Learning: Techniques focusing on selecting or generating potent demonstrations within prompts to guide the model more effectively, demonstrating how precision in data selection can significantly influence model behavior even without direct training on that data.
Task-specific Fine-tuning: Task-specific settings call for strategies that either increase the training data’s alignment with the target task or optimize data efficiency and robustness by carefully curating and diversifying the training samples.

Future Directions and Challenges

The review underlines the nuanced trade-offs between memorization and generalization inherent in data selection decisions. Innovations in direct data evaluation metrics, development of comprehensive benchmarks, and the shift towards more comprehensive data processing strategies are highlighted as key future directions.

Conclusion

This survey aims to provide a structured understanding of the landscape of data selection methods in machine learning, with a focus on LLMs. It emphasizes the intricate balance required in selecting data that both aligns with target tasks and ensures models are robust, fair, and efficient. As the field evolves, so too will the strategies for selecting the optimal datasets, underscoring the importance of continued research and innovation in this space.

PDF Markdown Bookmark Chat (Pro)

References (309)

Authors (14)

Alon Albalak (26 papers)
Yanai Elazar (44 papers)
Sang Michael Xie (21 papers)
Shayne Longpre (49 papers)
Nathan Lambert (37 papers)
Xinyi Wang (152 papers)
Niklas Muennighoff (56 papers)
Bairu Hou (14 papers)
Liangming Pan (59 papers)
Haewon Jeong (17 papers)
Colin Raffel (83 papers)
Shiyu Chang (120 papers)
Tatsunori Hashimoto (80 papers)
William Yang Wang (254 papers)

Citations (80)

View on Semantic Scholar

Tweets

https://twitter.com/AlbalakAlon/status/1762509535233941706

https://twitter.com/arankomatsuzaki/status/1762339260563124373

https://twitter.com/iScienceLuvr/status/1762386214852698247

https://twitter.com/Muennighoff/status/1762519627379163274

https://twitter.com/IntuitMachine/status/1781327549747912987

https://twitter.com/AlbalakAlon/status/1773364495852683504

HackerNews

A Survey on Data Selection for Language Models (1 point, 0 comments)