Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale

Published 25 Sep 2024 in cs.CL, cs.AI, and cs.LG | (2409.17115v2)

Abstract: LLM pre-training has traditionally relied on human experts to craft heuristics for improving the corpora quality, resulting in numerous rules developed to date. However, these rules lack the flexibility to address the unique characteristics of individual example effectively. Meanwhile, applying tailored rules to every example is impractical for human experts. In this paper, we demonstrate that even small LLMs, with as few as 0.3B parameters, can exhibit substantial data refining capabilities comparable to those of human experts. We introduce Programming Every Example (ProX), a novel framework that treats data refinement as a programming task, enabling models to refine corpora by generating and executing fine-grained operations, such as string normalization, for each individual example at scale. Experimental results show that models pre-trained on ProX-curated data outperform either original data or data filtered by other selection methods by more than 2% across various downstream benchmarks. Its effectiveness spans various model sizes and pre-training corpora, including C4, RedPajama-V2, FineWeb, FineWeb-Edu, and DCLM. Furthermore, ProX exhibits significant potential in domain-specific continual pre-training: without domain specific design, models trained on OpenWebMath refined by ProX outperform human-crafted rule-based methods, improving average accuracy by 7.6% over Mistral-7B, with 14.6% for Llama-2-7B and 20.3% for CodeLlama-7B, all within 10B tokens to be comparable to models like Llemma-7B trained on 200B tokens. Further analysis highlights that ProX significantly saves training FLOPs, offering a promising path for efficient LLM pre-training. We are open-sourcing ProX with >500B corpus, models, and sharing all training and implementation details for reproducible research and future innovation. Code: https://github.com/GAIR-NLP/ProX

Abstract PDF HTML Upgrade to Chat

Citations (6)

View on Semantic Scholar

Summary

The paper introduces ProX, a novel framework that refines pre-training data through programming, achieving an average 2.5% performance improvement over rule-based methods.
It employs a two-stage methodology—document-level filtering and chunk-level normalization—to automate granular data refinement tasks with minimal human intervention.
Experiments demonstrate that ProX-trained models require up to 20 times fewer FLOPs and excel in domain-specific continual pre-training tasks, such as on mathematical benchmarks.

Overview of "Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale"

The paper "Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale" proposes an innovative framework, Programming Every Example (ProX), for refining pre-training data at scale using smaller LMs. The essence of ProX lies in treating data refinement as a programming task, thereby empowering LMs to autonomously perform granular operations such as string normalization and document filtering. This technique addresses the inflexibility of heuristic rules and the impracticability of human experts tailoring rules for each data example individually.

Methodology

Document-Level and Chunk-Level Programming:

ProX operates in two stages: document-level and chunk-level programming. The document-level stage involves a refining model trained to determine whether to keep or drop a document. In contrast, the chunk-level stage is more granular, involving operations such as removing specific lines or normalizing strings within a document. The ProX refining models generate programs specifying these operations, which are then executed by pre-defined executors to produce refined corpora.

Model Adaptation and Data Collection:

The base models are adapted to data refinement tasks through supervised fine-tuning (SFT) on seed data generated by powerful LMs like Llama-3-70B. This adaptation enables smaller models (e.g., 0.3B parameters) to perform effective data refinement. The paper describes a systematic approach to data collection, including document scoring and few-shot prompting to generate program snippets.

Results

ProX has demonstrated its effectiveness across various benchmarks and model sizes. In experiments with RedPajama-V2, models pre-trained on ProX-curated data consistently outperformed those trained on raw or rule-based filtered data, with an average improvement of 2.5%. Notably, ProX outperformed existing data selection methods, such as MATES, by significant margins in zero-shot and few-shot performance.

Efficiency Gains:

ProX enables achieving comparable performance with much lesser computational overhead. For instance, ProX-trained models demonstrated similar downstream performance with up to 20 times less computing FLOPs compared to models pre-trained with raw data, highlighting its efficiency.

Domain-Specific Continual Pre-Training

The efficacy of ProX extends to domain-specific continuous pre-training, exemplified by experiments on the OpenWebMath corpus. ProX brought substantial improvements in average performance across mathematical benchmarks, significantly outperforming models like Llemma, which were pre-trained on up to 200 billion tokens. This underscores the framework’s robustness and adaptability to various domains.

Implications and Future Directions

Practical Implications:

The implications of ProX are significant for the development and deployment of LLMs, particularly in resource-constrained environments. It provides a scalable and automated method for refining large-scale pre-training datasets, effectively lifting data quality without the need for extensive human intervention. This not only improves training efficiency but also has potential applications in various domains requiring specialized training data.

Theoretical Implications:

Theoretically, ProX advances the field of data processing by demonstrating that data refinement tasks can be effectively addressed using model-driven programming approaches. This shifts the paradigm from rigid, heuristic-based pipelines to more flexible, model-based frameworks capable of adapting to diverse data quirks and inconsistencies.

Future Developments:

Future work could explore expanding ProX's capabilities by incorporating more sophisticated refining operations, such as reformatting and rephrasing. Additionally, there is potential for improving refinement efficiency by optimizing model size and leveraging inference acceleration techniques. Scaling ProX to handle multilingual corpora and domain-specific data like code also presents fertile ground for future research.

The promising results and efficiency gains demonstrated by ProX suggest that investing in data refinement before pre-training can yield significant dividends, pushing the boundaries of what is achievable with LLMs in both general and specialized domains.

Markdown