When Bad Data Leads to Good Models (2505.04741v1)
Abstract: In LLM pretraining, data quality is believed to determine model quality. In this paper, we re-examine the notion of "quality" from the perspective of pre- and post-training co-design. Specifically, we explore the possibility that pre-training on more toxic data can lead to better control in post-training, ultimately decreasing a model's output toxicity. First, we use a toy experiment to study how data composition affects the geometry of features in the representation space. Next, through controlled experiments with Olmo-1B models trained on varying ratios of clean and toxic data, we find that the concept of toxicity enjoys a less entangled linear representation as the proportion of toxic data increases. Furthermore, we show that although toxic data increases the generational toxicity of the base model, it also makes the toxicity easier to remove. Evaluations on Toxigen and Real Toxicity Prompts demonstrate that models trained on toxic data achieve a better trade-off between reducing generational toxicity and preserving general capabilities when detoxifying techniques such as inference-time intervention (ITI) are applied. Our findings suggest that, with post-training taken into account, bad data may lead to good models.
Summary
- The paper demonstrates that incorporating measured amounts of toxic data during LLM pretraining enhances post-training detoxification efforts.
- Experiments reveal that additional toxic data reduces feature entanglement, leading to more robust and separable internal representations.
- Using methods like Inference-Time Intervention, the approach achieves lower generative toxicity while preserving core model capabilities.
This paper, "When Bad Data Leads to Good Models" (2505.04741), challenges the conventional wisdom in LLM training that filtering out "bad" data, specifically toxic content, is always the best approach. The authors propose a co-design perspective, viewing pretraining and post-training as a unified system. They hypothesize that pretraining on more toxic data can, counterintuitively, lead to models that are easier to control and make less toxic during post-training.
The core idea is that exposure to diverse data, including toxic content, helps the model build robust internal representations. If the model has a better understanding of what toxicity is, interventions designed to reduce toxicity might be more effective and less likely to degrade other capabilities.
Motivating Experiment: Understanding Entanglement
To explore this idea, the paper starts with a toy experiment based on the superposition hypothesis (2209.10652). This hypothesis suggests that when an LLM's internal activation space is smaller than the number of features it needs to represent, multiple features get "superposed" or compressed onto single dimensions, leading to entanglement.
The authors define feature entanglement EPi for a feature Pi as the maximum absolute cosine similarity between its direction vPi and the direction of any other feature Pj:
EPi=max{∣vPi⋅vPj∣}j∈[N]\{i}
They use a toy transformer model trained on data generated by multiple cyclic Markov chains, treating each unique sequence from a chain as a "feature." By varying the amount of data from one specific chain (the "underrepresented" feature), they observe that increasing the frequency of this feature in the training data leads to a significant drop in its entanglement measure (Fig. 2). This suggests that providing more data related to a concept helps the model learn a less entangled, and thus potentially more separable, representation of that concept in its hidden space.
Pretraining with Toxic Data on Olmo-1B
Translating this to a realistic LLM setting, the authors train a series of Olmo-1B models (2402.00838) using varying proportions of clean data (C4 (2004.10964)) and toxic data (4chan (2012.09858)). They keep the clean data amount constant and add 4chan data in increments from 0% to 25% of the total training data.
Evaluation of the base models shows that adding toxic data, as expected, increases the model's generational toxicity. However, it also improves the model's ability to detect toxicity (measured on ToxiGen (2207.06452)). Importantly, the paper shows that adding toxic data up to 20% has minimal negative impact on the model's general capabilities as measured by benchmarks like MMLU (2009.03300) and various downstream QA tasks (Fig. 3, Table A.1, Table A.2). This indicates that incorporating toxic data doesn't immediately degrade the model's core functions.
Toxic Data Improves Concept Building and Alignability
The paper then investigates the internal representations of toxicity. Using linear probing experiments on attention head activations, they train binary classifiers to predict toxicity from activations based on ToxiGen data. They find that models trained with toxic data exhibit significantly higher probe accuracies, particularly showing a "fatter tail" of heads with very high accuracy (Fig. 3, Fig. A.2). This suggests that toxic data helps the model develop better, more linearly separable representations of toxicity in specific parts of the network.
With the hypothesis that better internal representations lead to better alignability, the authors test two post-training techniques: simple prompting and Inference-Time Intervention (ITI) (2310.01405). ITI works by identifying directions in the activation space corresponding to a desired attribute (e.g., non-toxicity) and steering the model's activations along that direction during inference.
The results are striking (Fig. 4): while the base model's toxicity increases with more toxic pretraining data, the steered models show the opposite trend. Their generational toxicity (evaluated using Perspective API [2024] on ToxiGen and Real Toxicity Prompts (2012.09858)) decreases as the proportion of toxic pretraining data increases, up to a peak effectiveness around 10% toxic data in their setup. This demonstrates that models pretrained with some toxic data are indeed more alignable for detoxification via ITI.
Comparing their method (10% toxic pretraining data + ITI) against various baselines, including prompting, MEDA/INST (2302.07388), Supervised Finetuning (SFT), and Direct Preference Optimization (DPO) (2305.18290), the authors show that their approach achieves a better trade-off. It results in significantly lower toxicity while incurring minimal "alignment tax" (measured by cross-entropy loss on Open Web Text [2019]), indicating better preservation of general capabilities (Table 1). They also show that adding toxic pretraining data improves the effectiveness of SFT and DPO for detoxification (Table 2), suggesting the benefits extend beyond ITI.
Furthermore, red-teaming experiments using GCG attacks (2310.01405) show that combining toxic pretraining (10%) with strong ITI significantly reduces the attack success rate compared to models trained only on clean data, even with steering (Table 3). This implies that toxic pretraining can also improve robustness against adversarial attempts to elicit harmful content.
Practical Implementation and Considerations
Implementing the findings of this paper involves several steps:
- Data Curation and Mixing:
- Identify sources of "clean" and the specific type of "bad" data (e.g., toxic content). C4 and 4chan are used as examples.
- Determine the desired ratio of bad data to mix into the pretraining corpus. The paper suggests this is empirical; 10% worked well for Olmo-1B with their specific datasets and post-training methods. This requires careful experimentation to find the "sweet spot" where alignability is maximized without excessive negative impact.
- Implement data loading pipelines that efficiently mix data from different sources according to the determined ratio.
- Pretraining:
- Train the LLM from scratch on the mixed dataset. This is computationally expensive. The paper used 16 Nvidia H100 GPUs for a 1B parameter model for 12 hours. Scaling to larger models requires substantial compute resources.
- Train multiple models with different mixing ratios to empirically determine the optimal proportion for a given model size and training setup.
- Interpretability (Probing):
- After pretraining, select a representative dataset (like ToxiGen) and run inference to capture intermediate activations (e.g., attention head outputs or residual stream) for relevant examples (toxic/non-toxic texts).
- Train simple linear classifiers (e.g., logistic regression) on these activations to probe for the concept (e.g., toxicity). Evaluate probe accuracy to identify heads or layers with strong representations.
- Calculate feature directions (e.g., using probe weights or activation differences between toxic/non-toxic examples).
- Post-training (ITI):
- Identify which heads/layers and directions to intervene on based on probing results and alignability experiments.
- Modify the model's inference loop. At chosen layers/heads, add a scaled version of the calculated feature direction to the activation vector before subsequent computations.
- Tune ITI hyperparameters (e.g., steering strength, number of heads intervened upon) on validation data to find the best trade-off between reduced toxicity and preserved performance.
- Deployment of models with ITI involves this modified inference code, which adds a small computational overhead per intervened layer but is typically much less expensive than running multiple models or complex decoding strategies.
- Post-training (SFT/DPO):
- Alternatively, if using SFT or DPO for post-training alignment, the pretrained model (trained on mixed data) serves as the starting point.
- Gather or generate high-quality SFT instruction data or DPO preference data related to toxicity.
- Perform the standard SFT or DPO training process. The paper suggests that toxic pretraining can make these processes more efficient and effective at achieving alignment with less performance degradation.
Limitations and Future Work:
The optimal ratio of toxic data is empirical and may differ for various model sizes, architectures, and target datasets/tasks. Adding too much toxic data can still negatively impact the model's alignability and potentially introduce other undesirable characteristics. The concept of toxicity itself is complex and context-dependent; relying solely on metrics like Perspective API might not capture all nuances.
Future work directions include exploring if these findings generalize to other alignment features beyond toxicity, quantitatively determining the precise relationship between feature frequency and alignability, and investigating the underlying mechanistic reasons why toxic data improves representation and steerability.
In summary, this paper provides empirical evidence supporting a counterintuitive approach to data curation for LLMs: selectively including "bad" data like toxic content during pretraining can lead to models that are more effectively controllable and alignable during post-training, ultimately resulting in "good" models with reduced toxicity and preserved capabilities. This highlights the importance of a co-design approach to the entire LLM training pipeline.