GPT-4o as the Gold Standard: A Scalable and General Purpose Approach to Filter Language Model Pretraining Data (2410.02755v3)

Published 3 Oct 2024 in cs.CL and cs.LG

Abstract: LLMs require vast amounts of high-quality training data, but effective filtering of web-scale datasets remains a significant challenge. This paper demonstrates that GPT-4o is remarkably effective at identifying high-quality training data, but its prohibitive cost makes it impractical at web-scale. We propose SIEVE, a lightweight alternative that matches GPT-4o accuracy at less than 1\% of the cost. SIEVE can perform up to 500 filtering operations for the cost of one GPT-4o filtering call. The key to SIEVE is a seamless integration of GPT-4o and lightweight text classification models, using active learning to fine-tune these models in the background with a small number of calls to GPT-4o. Once trained, it performs as well as GPT-4o at a tiny fraction of the cost. Through different filtering prompts, SIEVE can efficiently curate high quality data for general or specialized domains from web-scale corpora -- a valuable capability given the current scarcity of high-quality domain-specific datasets. Extensive experiments using automatic and human evaluation metrics show that SIEVE and GPT-4o achieve similar performance on five highly specific filtering prompts. In addition, when performing quality filtering on web crawl datasets, we demonstrate SIEVE can further improve over state-of-the-art quality filtering methods in the DataComp-LM challenge for selecting LLM pretraining data.

Citations (1)

View on Semantic Scholar

Summary

The paper introduces SIEVE, which uses active learning to train a T5 model that mimics GPT-4o's filtering decisions.
It validates the method on OpenWebText, achieving equivalent accuracy with only 1% of the cost typically required by GPT-4o.
The system offers a cost-effective approach to dataset curation, potentially democratizing the development of specialized large language models.

SIEVE: General Purpose Data Filtering System Matching GPT-4o Accuracy at 1% the Cost

The paper introduces SIEVE, a system devised to filter vast datasets effectively while maintaining the high accuracy associated with models like GPT-4o but at substantially reduced costs. The authors identify a key challenge in the creation of specialized LLMs, which involves curating high-quality, domain-specific datasets from extensive web-scale data.

SIEVE innovatively addresses this problem by leveraging a combination of a lightweight T5 model and a limited number of GPT-4o calls. Through active learning, the T5 model is fine-tuned to emulate GPT-4o's filtering decisions, effectively reducing the filtering cost to just 1% of the typical expenses associated with directly using GPT-4o. This approach is validated experimentally on the OpenWebText dataset with five highly customized filtering tasks, demonstrating equivalent accuracy to GPT-4o.

Technical Approach

The system's architecture involves a seamless integration of GPT-4o and T5 models. SIEVE predominantly operates by training the T5 model via active learning to mimic GPT-4o's filtering criteria, decreasing the frequency of costly GPT-4o queries. The active learning strategy employed is stream-based, optimizing the selection of informative snippets, focusing particularly on imbalanced datasets, wherein the target minority class needs more robust detection.

Experimental Validation

The experiments endorse SIEVE's efficiency and effectiveness, showing that it matches GPT-4o's quality filtering at a fraction of the cost. They employ a diverse set of filter tasks on OpenWebText, encompassing political, climate, and AI content, among others. The lightweight model achieves impressive performance, even showing a cost reduction in terms of budget and computational resources.

Theoretical Insights

The paper offers theoretical analysis on the balancedness of annotated examples, demonstrating that by querying more balanced sets of snippets, the algorithm efficiently curates datasets despite inherent class imbalances. This is particularly crucial for their stream-based active learning approach, which represents an enhancement over pool-based strategies that are computationally intensive.

Implications and Future Directions

The implications of SIEVE are significant, potentially democratizing the development of domain-specific LLMs by mitigating prohibitive data curation costs. Practically, it opens possibilities for widespread application in various fields requiring tailored LLMs, such as scientific research, industry-specific models, and more.

Future research could explore the scalability of SIEVE to larger and more diverse data sets, like the PILE, and expanding its utility to other data modalities beyond text. The integration of advanced active learning algorithms and exploration of more complex teacher models than GPT-4o could also enhance its capacities further, especially as computational cost considerations become increasingly critical in the era of massive LLMs.

PDF Markdown

Related Papers

Tweets

https://twitter.com/jifan_zhang/status/1843713814065565986