Data Selection for LLMs via Importance Resampling
The discussed paper presents a novel method, Data Selection with Importance Resampling (DSIR), aimed at improving the selection of pretraining data for LLMs (LMs). The paper formalizes the problem of selecting a subset of a large unlabeled dataset to match a desired target distribution, leveraging unlabeled target samples. Given the high dimensionality of raw text data, previous approaches typically relied on heuristics or manually curated data. DSIR, by contrast, extends the classic importance resampling approach, providing a scalable solution for efficient data selection in LMs.
The proposed DSIR framework centers around estimating importance weights within a reduced hashed n-gram feature space to facilitate tractable computations. This mechanism enables the selection of large datasets swiftly, demonstrated by DSIR's capability to select 100 million documents from The Pile dataset within 4.5 hours. To ensure that the selected data aligns well with the target distribution, the concept of KL reduction is introduced. This data metric measures the reduction in Kullback–Leibler divergence, providing insight into the proximity of selected pretraining data to the target.
Experimental results indicate that hashed n-gram features correlate significantly (r = 0.82) with downstream model accuracy across various data selection methods. Furthermore, when applied to the task of continued pretraining in specific domains, DSIR performs on par with expert human curation across eight target datasets, suggesting its viability as an automated data selection tool. For training general-domain models, DSIR shows a notable improvement over random selection and heuristic baselines by 2-2.5% on the GLUE benchmark.
These results underscore the DSIR approach's potential to optimize dataset selection effectively, which is critical for LM success when constrained by fixed computational budgets. The enhanced performance metrics not only suggest practical advantages but also illustrate the framework's contribution to theoretical advancements in LM pretraining strategies.
Future research could explore alternative feature spaces and estimators to further refine and adapt the method to different textual structures. Additionally, analyzing the impact of DSIR in real-world applications or assessing the framework's alignment concerning ethical considerations in dataset selection could present intriguing areas for further exploration.