Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Data Selection for Language Models via Importance Resampling (2302.03169v3)

Published 6 Feb 2023 in cs.CL and cs.LG

Abstract: Selecting a suitable pretraining dataset is crucial for both general-domain (e.g., GPT-3) and domain-specific (e.g., Codex) LLMs (LMs). We formalize this problem as selecting a subset of a large raw unlabeled dataset to match a desired target distribution given unlabeled target samples. Due to the scale and dimensionality of the raw text data, existing methods use simple heuristics or require human experts to manually curate data. Instead, we extend the classic importance resampling approach used in low-dimensions for LM data selection. We propose Data Selection with Importance Resampling (DSIR), an efficient and scalable framework that estimates importance weights in a reduced feature space for tractability and selects data with importance resampling according to these weights. We instantiate the DSIR framework with hashed n-gram features for efficiency, enabling the selection of 100M documents from the full Pile dataset in 4.5 hours. To measure whether hashed n-gram features preserve the aspects of the data that are relevant to the target, we define KL reduction, a data metric that measures the proximity between the selected pretraining data and the target on some feature space. Across 8 data selection methods (including expert selection), KL reduction on hashed n-gram features highly correlates with average downstream accuracy (r=0.82). When selecting data for continued pretraining on a specific domain, DSIR performs comparably to expert curation across 8 target distributions. When pretraining general-domain models (target is Wikipedia and books), DSIR improves over random selection and heuristic filtering baselines by 2-2.5% on the GLUE benchmark. Code is available at https://github.com/p-lambda/dsir.

Data Selection for LLMs via Importance Resampling

The discussed paper presents a novel method, Data Selection with Importance Resampling (DSIR), aimed at improving the selection of pretraining data for LLMs (LMs). The paper formalizes the problem of selecting a subset of a large unlabeled dataset to match a desired target distribution, leveraging unlabeled target samples. Given the high dimensionality of raw text data, previous approaches typically relied on heuristics or manually curated data. DSIR, by contrast, extends the classic importance resampling approach, providing a scalable solution for efficient data selection in LMs.

The proposed DSIR framework centers around estimating importance weights within a reduced hashed n-gram feature space to facilitate tractable computations. This mechanism enables the selection of large datasets swiftly, demonstrated by DSIR's capability to select 100 million documents from The Pile dataset within 4.5 hours. To ensure that the selected data aligns well with the target distribution, the concept of KL reduction is introduced. This data metric measures the reduction in Kullback–Leibler divergence, providing insight into the proximity of selected pretraining data to the target.

Experimental results indicate that hashed n-gram features correlate significantly (r = 0.82) with downstream model accuracy across various data selection methods. Furthermore, when applied to the task of continued pretraining in specific domains, DSIR performs on par with expert human curation across eight target datasets, suggesting its viability as an automated data selection tool. For training general-domain models, DSIR shows a notable improvement over random selection and heuristic baselines by 2-2.5% on the GLUE benchmark.

These results underscore the DSIR approach's potential to optimize dataset selection effectively, which is critical for LM success when constrained by fixed computational budgets. The enhanced performance metrics not only suggest practical advantages but also illustrate the framework's contribution to theoretical advancements in LM pretraining strategies.

Future research could explore alternative feature spaces and estimators to further refine and adapt the method to different textual structures. Additionally, analyzing the impact of DSIR in real-world applications or assessing the framework's alignment concerning ethical considerations in dataset selection could present intriguing areas for further exploration.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Sang Michael Xie (21 papers)
  2. Shibani Santurkar (26 papers)
  3. Tengyu Ma (117 papers)
  4. Percy Liang (239 papers)
Citations (128)
Github Logo Streamline Icon: https://streamlinehq.com