Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 45 tok/s

Gemini 2.5 Pro 52 tok/s Pro

GPT-5 Medium 30 tok/s Pro

GPT-5 High 24 tok/s Pro

GPT-4o 96 tok/s Pro

Kimi K2 206 tok/s Pro

GPT OSS 120B 457 tok/s Pro

Claude Sonnet 4 36 tok/s Pro

2000 character limit reached

Generative Data Refinement: Just Ask for Better Data (2509.08653v2)

Published 10 Sep 2025 in cs.LG and cs.CL

Abstract: For a fixed parameter size, the capabilities of large models are primarily determined by the quality and quantity of its training data. Consequently, training datasets now grow faster than the rate at which new data is indexed on the web, leading to projected data exhaustion over the next decade. Much more data exists as user-generated content that is not publicly indexed, but incorporating such data comes with considerable risks, such as leaking private information and other undesirable content. We introduce a framework, Generative Data Refinement (GDR), for using pretrained generative models to transform a dataset with undesirable content into a refined dataset that is more suitable for training. Our experiments show that GDR can outperform industry-grade solutions for dataset anonymization, as well as enable direct detoxification of highly unsafe datasets. Moreover, we show that by generating synthetic data that is conditioned on each example in the real dataset, GDR's refined outputs naturally match the diversity of web scale datasets, and thereby avoid the often challenging task of generating diverse synthetic data via model prompting. The simplicity and effectiveness of GDR make it a powerful tool for scaling up the total stock of training data for frontier models.

Collections

Summary

The paper introduces a framework that leverages pretrained generative models to sanitize datasets by removing PII, toxic language, and sensitive data while keeping essential information.
Methodology involves prompt engineering, model adaptation, and verification functions, achieving superior results with a recall of 0.99 and precision of 0.80 compared to rule-based methods.
Empirical evaluations across text, code, and detoxification tasks demonstrate that GDR salvages unusable data and enhances model training by preserving diversity and reducing noise.

Motivation and Problem Statement

The paper introduces Generative Data Refinement (GDR), a framework leveraging pretrained generative models to transform datasets containing undesirable content—such as personally identifiable information (PII), toxic language, or sensitive facts—into refined datasets suitable for model training. The motivation stems from the observation that the scaling laws governing large model performance are increasingly constrained by the availability and quality of training data. As web-indexed data approaches exhaustion, vast quantities of user-generated and proprietary data remain untapped due to privacy, safety, and copyright risks. Existing synthetic data generation and differential privacy (DP) approaches either fail to preserve data utility or suffer from mode collapse and overfitting, limiting diversity and realism.

GDR Framework and Methodology

GDR reframes synthetic data generation as a grounded process: each real data sample $x_i$ is transformed by a generative process $g(\cdot|x_i)$ , producing $y_i$ that satisfies a semantic constraint $h(y_i) = 1$ (e.g., no PII, low toxicity) while minimizing a distance metric $A(x_i, y_i)$ . This approach anchors synthetic data to real examples, preserving diversity and realism. The generative model (typically an LLM) is prompted or fine-tuned to rewrite each sample, selectively removing or replacing undesirable content while retaining useful information.

Key implementation details include:

Prompt Engineering: Zero-shot and few-shot prompts are designed for specific domains (text, code, JSON) and constraints (PII removal, detoxification).
Model Adaptation: Performance can be improved via few-shot prompting and supervised fine-tuning (SFT) on domain-specific examples, enabling smaller models to match or surpass larger ones.
Verification Functions: Criteria for refinement are encoded as indicator functions $h$ , which can be implemented via rule-based, classifier, or API-based methods (e.g., Perspective API for toxicity).

Empirical Evaluation

PII Anonymization

GDR is benchmarked against a commercial Detector-based Information Removal Service (DIRS) across 20k sentences and 108 PII categories. GDR, using a single zero-shot prompt with Gemini Pro 1.5, achieves higher recall and precision than DIRS, which relies on brittle rule-based and statistical detectors. Notably, GDR generalizes across PII types and contexts, salvaging data that DIRS would otherwise discard.

Recall: GDR achieves 0.99 vs. DIRS's 0.53.
Precision: GDR achieves 0.80 vs. DIRS's 0.52.
F-score: GDR achieves 0.88.

Smaller models (Flash 8B, Gemma 9B/27B) approach Gemini Pro 1.5's recall but lag in precision. Few-shot prompting and SFT on 10k examples enable Flash 8B to surpass Gemini Pro 1.5, demonstrating that compute cost can be amortized by adapting smaller models.

Utility of Anonymized Data

Models trained on GDR-refined datasets retain the ability to answer questions about public facts while failing to recite private facts, confirming that GDR preserves utility without leaking sensitive information. In contrast, DIRS-redacted datasets suffer from low precision, indiscriminately removing both private and public information.

Codebase Anonymization

GDR is applied to 1.2M lines of code from 479 repositories, outperforming DIRS in agreement with human expert annotations at both document and line levels. GDR's generative rewrites accurately identify and replace PII in code comments, strings, and configuration files, minimizing false positives and negatives. Some failure modes include over-conservative rewrites and missed hash values, but these are rare and can be mitigated via static analysis and prompt refinement.

Content Detoxification

GDR is used to detoxify 100k message pairs from the /pol/ board of 4chan, notorious for toxic content. Using Gemini Pro 1.5 and a zero-shot prompt, GDR reduces mean toxicity scores (Perspective API) from 0.19 (raw) to 0.13 (refined), outperforming synthetic chat baselines. Extracted question-answer pairs from detoxified data demonstrate that world knowledge is preserved. Models fine-tuned on GDR-refined data achieve higher accuracy on knowledge quizzes and produce responses less likely to be detected as LLM-generated, indicating improved human-likeness.

Diversity Analysis

GDR-refined datasets exhibit greater diversity than synthetic datasets generated via direct model prompting, as measured by ROUGE-2 and embedding-based metrics. UMAP visualizations confirm that GDR avoids mode collapse, maintaining coverage of the latent space comparable to or exceeding the original data.

Theoretical and Practical Implications

GDR addresses key limitations of DP and synthetic data generation:

Selective Content Removal: Unlike DP, which injects noise and degrades utility, GDR uses LLMs as intelligent noising operators, selectively rewriting only problematic content.
Data Salvaging: GDR enables the recovery and reuse of otherwise unusable data, increasing the effective stock of training tokens for frontier models.
Scalability: While GDR incurs significant compute cost (up to one-third of a full training run), this cost is amortized by dataset reuse and can be reduced via model adaptation.
Generalizability: GDR is applicable across domains (text, code, structured data) and constraints (privacy, safety, copyright), and can be integrated into composite data pipelines.

Future Directions

Potential extensions include:

On-policy Distillation and RL Fine-tuning: Leveraging reward models for both risk detection and information preservation.
Corpus-level Risk Mitigation: Addressing indirect leakage via cross-document inference.
Multimodal Data Refinement: Applying GDR to images, audio, and other modalities.
Automated Prompt Optimization: Systematic search for optimal prompts and verification functions.

Conclusion

Generative Data Refinement provides a principled, empirically validated framework for dataset sanitization and augmentation using pretrained generative models. By anchoring synthetic data generation to real examples and leveraging the world knowledge of LLMs, GDR achieves superior performance in privacy, safety, and diversity, with broad applicability to scaling and curating training data for large models. The approach is complementary to existing synthetic data and privacy-preserving methods, and its effectiveness is contingent on continued advances in generative modeling and prompt engineering.