A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity (2305.13169v2)

Published 22 May 2023 in cs.CL and cs.LG

Abstract: Pretraining is the preliminary and fundamental step in developing capable LLMs (LM). Despite this, pretraining data design is critically under-documented and often guided by empirically unsupported intuitions. To address this, we pretrain 28 1.5B parameter decoder-only models, training on data curated (1) at different times, (2) with varying toxicity and quality filters, and (3) with different domain compositions. First, we quantify the effect of pretraining data age. A temporal shift between evaluation data and pretraining data leads to performance degradation, which is not overcome by finetuning. Second, we explore the effect of quality and toxicity filters, showing a trade-off between performance on standard benchmarks and risk of toxic generations. Our findings indicate there does not exist a one-size-fits-all solution to filtering training data. We also find that the effects of different types of filtering are not predictable from text domain characteristics. Lastly, we empirically validate that the inclusion of heterogeneous data sources, like books and web, is broadly beneficial and warrants greater prioritization. These findings constitute the largest set of experiments to validate, quantify, and expose many undocumented intuitions about text pretraining, which we hope will help support more informed data-centric decisions in LM development.

Citations (107)

View on Semantic Scholar

Summary

The paper shows that temporal mismatches in training data significantly degrade LM performance, with larger models being more affected.
The paper finds that quality filtering improves task performance while paradoxically increasing toxic generation, underscoring crucial trade-offs.
The paper demonstrates that diverse domain inclusion enhances generalization across tasks but can amplify toxicity risks in outputs.

This paper, "A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity" (2305.13169), conducts a large-scale empirical paper to quantify how various pretraining data curation choices impact the performance and behavior of LLMs (LMs). The authors pretrain 28 1.5B parameter decoder-only LMs (and some smaller 20M parameter models for scaling comparisons) on datasets manipulated along dimensions of data age, domain composition, quality filtering, and toxicity filtering. The base datasets used are C4 and The Pile.

The core motivation is to move beyond empirically unsupported intuitions in pretraining data design, which is often under-documented despite its fundamental role in LM development.

Methodology

The general approach involves:

Starting with a base pretraining dataset (C4 or The Pile).
Applying a specific filter or modification (e.g., removing data from a certain year, filtering by a quality score).
Pretraining a 1.5B parameter decoder-only LM (LM-XL, similar to t5.1.1-XL) on the curated dataset.
Evaluating the pretrained model by finetuning it on a diverse set of downstream tasks and measuring performance.

Evaluations Covered:

Domain Generalization: 30 question-answering (QA) datasets from MRQA and UnifiedQA.
Temporal Misalignment: 5 datasets (PubCLS, NewSum, PoliAffs, TwiERC, AIC) with yearly splits to assess performance degradation due to time differences between pretraining and evaluation data.
Toxic Generation: Using prompts designed to elicit biased/toxic outputs (e.g., from RealToxicityPrompts, PaLM bias prompts) and scoring generations with the Perspective API.
Toxicity Identification: Finetuning and evaluating on datasets like Social Bias Frames (SBF), DynaHate (DH), and Toxigen.

Key Findings and Recommendations

1. Impact of Dataset Age:

Performance Degradation: A temporal shift (either pretraining data being older or newer than evaluation data) leads to performance degradation on downstream tasks. This degradation is not overcome even with substantial, temporally-relevant finetuning.
Asymmetric Effect: The degradation is steeper when the evaluation data is newer than the pretraining data.
Staleness: Both models (trained on older data) and evaluation datasets can become "stale." Newer models may perform worse on older evaluations, and older models perform worse on newer evaluations. This complicates comparisons between models trained at different times.
Scaling Effect: The negative impact of temporal misalignment is more pronounced for larger models (1.5B) than for smaller models (20M).
Recommendation: Model creators should report the temporal distribution of their pretraining data. Users should be aware of potential performance issues when there's a significant temporal gap.

2. Impact of Quality and Toxicity Filters:

Quality Filtering (Positive Definition):
- Improves Performance: Removing text classified as low-quality (using a classifier trained to prefer Wikipedia/Books-like text) substantially improves downstream performance on most QA tasks and toxicity identification, even though it reduces training data size.
- Increases Toxic Generation: Surprisingly, quality filtering leads to models that generate more toxic content.
- Unpredictable Benefits: The benefits of quality filtering are not always predictable from the raw quality characteristics of the dataset domains (e.g., Books QA performance was hurt despite Books domain being high-quality). Different segments of data along the quality spectrum have varied effects.
Toxicity Filtering (Negative Definition):
- Reduces Toxic Generation: Removing documents flagged as toxic by the Perspective API reduces the model's tendency to generate toxic content.
- Reduces Generalization & Identification: This comes at the cost of reduced performance on general QA tasks and, importantly, a reduced ability to identify toxic content.
- Inverse Toxicity Filter: Removing the least toxic content (i.e., training on more toxic data) leads to the best performance on toxicity identification tasks.
Trade-offs: There's no "one-size-fits-all" filtering strategy. The choice depends on the desired model behavior.
Recommendation: Practitioners need more targeted filters. For toxicity identification, an inverse toxicity filter might be beneficial. For general quality, multiple dimensions of quality should be considered, as a single classifier doesn't capture all nuances. Prioritize toxic identification during pretraining and use post-hoc methods (like instruction tuning) to curb unwanted generation.

3. Impact of Domain Composition (using The Pile):

Heterogeneity is Key: Including heterogeneous data sources like Common Crawl (CC), OpenWeb, and Books has the strongest positive impact on average downstream QA performance, often more so than including domain-specific data for a targeted evaluation.
Targeted Data: While removing a pretraining domain generally hurts downstream tasks aligned with that domain (e.g., removing PubMed hurts BioMed QA), the impact of removing large heterogeneous sources (like CC) can be even greater across many domains.
Include More Sources: The best-performing models generally use all available data sources, suggesting that both quantity and diversity of open-source data remain bottlenecks.
Toxicity Trade-off: Web and Books domains, while highly beneficial for performance, also contribute most to toxic generation. Removing them reduces toxic generation but also harms toxicity identification and QA performance.
Recommendation: Prioritize collecting more diverse web and books content. Generously include data sources, even those seemingly less relevant to target tasks, for better generalization.

Observational Data Characteristics (Pre-Filter Analysis):

C4 vs. The Pile: The Pile documents are generally longer, more readable, and rated higher in quality but contain more Personally Identifiable Information (PII) than C4 documents.
Books Domain: An outlier, with the longest, most readable, most toxic (profane, sexual content), and most PII-filled documents, yet also rated as high quality.
Toxicity vs. Quality: High toxicity does not necessarily mean low quality by the classifiers used. Documents classified as highly toxic can still have high text quality scores, partly due to the Books domain.
Temporal Trends in C4: More recent Common Crawl scrapes (used for C4 versions) show increased non-ASCII characters (diversity/emojis) and decreased measured text quality, with slightly lower toxicity.

Discussion and Broader Implications:

Pretraining data curation choices are critical hyperparameters that significantly affect model behavior and performance, and these effects are not easily erased by finetuning.
The paper urges more transparency and systematic evaluation in data curation practices.
The findings highlight the complex interplay between data characteristics and model outcomes, emphasizing that simple heuristics are often insufficient.

Limitations:

Computational Cost: The paper was expensive, limiting the number of experimental variations.
Blackbox APIs: Reliance on the Perspective API for toxicity scoring introduces potential irreproducibility if the API changes.
English-Only: The analysis is confined to English datasets.
Finetuned Setting: Results primarily focus on finetuned models, and their direct translation to zero/few-shot prompting settings is not established.

In conclusion, the paper provides substantial empirical evidence for how data age, filtering, and domain composition influence LM pretraining, offering valuable guidance for practitioners to make more informed data-centric decisions.

PDF Markdown

Related Papers

Tweets

https://twitter.com/Yampeleg/status/1796901742715498604

https://twitter.com/soldni/status/1746929231890964955

https://twitter.com/nopainkiller/status/1797059234200453575

YouTube

Show All Videos