RefineX: Scalable Data Refinement for LLMs
- RefineX is a programmatic data refinement framework that systematically removes low-quality text spans using efficient, character-level deletion programs.
- It employs an expert-guided distillation pipeline and minimal edit distance techniques to generate precise deletion operations while retaining natural language quality.
- Scalable across massive corpora, RefineX enhances LLM performance by ensuring minimal artifact introduction and preserving inherent linguistic features.
RefineX is a programmatic data refinement framework for large-scale pre-training corpora, designed to systematically improve the quality of data prior to training LLMs. Unlike traditional rule-based document-level filtering or neural generative refinement, RefineX implements efficient, fine-grained, character-level deletion programs distilled from expert-guided end-to-end edits. This methodology enables scalable, precise, and reliable removal of irrelevant or low-quality spans, while preserving the diversity and naturalness of the pre-training distribution and minimizing the introduction of artifacts.
1. Formalization and Scope
Let be a raw text corpus, where each is a character sequence. The objective is to construct a refinement operator such that, when applied instance-wise (), the downstream performance of an LLM with parameters —trained only on , under a fixed compute budget—is maximized over a task set . The search for is restricted to character-level, deletion-only actions: The admissible 0 comprises only deletion programs, excluding insertions and substitutions, to ensure operational minimality and avoid style drift or information “hallucination” (Bi et al., 4 Jul 2025).
2. Expert-Guided Distillation Pipeline
The refinement process consists of a two-stage pipeline:
a. Seed Corpus Sampling
- Raw data 1 is scored by the DataMan method, assigning a quality-level (1–5) to each document.
- Approximately 5 million documents are sampled, maintaining the DataMan stratification.
- Each long document is split into overlapping 12,000-character chunks to fit the expert LLM input window.
b. Expert End-to-End Deletion
- The Qwen2.5–72B–Instruct LLM is prompted to refine, with explicit instructions for deletion-only editing and retention of on-topic links.
- Decoding hyperparameters: top-2, top-3.
- Outputs include a refined version 4 of each input 5 and an enumerated “modification_reason”.
- The resulting 6 pairs form a high-quality but expensive and potentially over-edited demonstration set.
c. Minimal Edit Distillation
- For each 7, the minimal edit distance is computed via Levenshtein algorithms (Python
difflib.get_opcodes), yielding 8, a sequence of deletions, insertions, and replacements realising 9. - All non-deletion ops are dropped, and only deletions 0 (each contiguous span) are retained.
- Each 1 is mapped to a domain-specific language (DSL) statement:
remove_lines(start_line, end_line)remove_str(line, del_str)keep_all()(no modification)
Filtering removes: (i) cases with insertion/replace spans 2 20 chars, (ii) trivial examples (310 chars deleted).
- The finalized distillation set comprises about two million high-fidelity 4 pairs.
3. Model Architecture and Training
The program generator is a decoder-only Transformer initialized as Qwen-3-Base (0.6B params), featuring:
- 24 layers, hidden size 1,536, 24 self-attention heads.
- Vocabulary: 32k BPE.
- Context window: 512k characters (approx 2k tokens).
The model input is of the form:
8 Output is a linearized DSL sequence, for example: 9
Training objective is cross-entropy over the gold program tokens: 6 Regularization is achieved by layer dropout of 0.1, output length cap at 512, and a program length penalty. Scaling to 1.7B, 4B, 8B models offered marginal improvements, but with higher latency.
4. Corpus-Scale Inference and Guarantees
Corpus-wide refinement is executed with the following pseudocode:
0
This process is linear in total character count and per-chunk inference time. Large batch sizes (~8 chunks/GPU) are accommodated with vLLM for throughput, and overlapping chunks prevent context loss.
Only deletion operations are introduced, preserving the original token and n-gram statistics, natural distribution, and multimodal evidence in the data.
5. Experimental Protocol and Comparative Evaluation
a. Datasets and Baseline Methods
- Base corpus: RedPajama-V2 (400B tokens pre-filter), subsampled/refined to 20B tokens per method.
- Methods:
- Raw (unfiltered)
- Rule-based (Gopher, C4, FineWeb, Comb)
- LLM-based filter (Prox-D)
- Programmatic refine (ProX-C)
- RefineX
b. Pretraining Details
LLM architectures: Llama-2 family at 350M and 750M scales.
- Training regime: 10k steps, batch 1,024×2,048 tokens, total 20B tokens.
- Hyperparameters: LR=7 (cosine decay), AdamW, weight decay 0.1, 500-step warmup.
c. Evaluation Protocol
- LightEval suite: ARC-C, ARC-E, CSQA, HellaSwag, MMLU (57 sub-tasks), OBQA, PIQA, SIQA, WinoGrande, SciQ.
- Each accuracy measured zero-shot on 1,000 samples/task.
6. Quantitative Profile
Table: Downstream Accuracy (750M LLM, 20B Tokens)
| Dataset Filter | Avg. Accuracy | Δ vs. Raw |
|---|---|---|
| Raw | 41.6% | – |
| ProX-C | 42.4% | +0.8% |
| RefineX | 42.9% | +1.3% |
Further, for the Comb baseline:
- Comb: 42.1%, +ProX-C 42.4%, +RefineX 42.8%.
- With Prox-D: 42.0% → +ProX-C 43.5% → +RefineX 44.7%.
Token efficiency: models pretrained on 10B RefineX-refined tokens match—or slightly exceed—the performance of 20B Comb-filtered tokens.
Instance-level quality (DataMan evaluation): For score=3 sub-corpus, RefineX improves 41.2% of documents (degrades 4.58%), with average post-refinement score 3.45. E2E refinement improves 59.0% but has higher risk of over-editing (degrades 3.39%).
Over-editing risk: Number of new words per 1,000 refined tokens — E2E 15.06, ProX-C 0.17, RefineX 0.00 for low-quality (score=2) spans, demonstrating the deletion-only constraint is effective in limiting artifact introduction.
7. Comparative Merits, Limitations, and Outlook
Reliability: The staged distillation system—expert E2E to minimal deletion program—safeguards against model hallucinations and style drift. Supervision remains highly aligned with expert instructions, benefiting reproducibility and generalization.
Precision: Deletion-only edit programs serve as a conservative, reliable filter, minimizing spurious modifications and preserving original linguistic and stylistic features.
Efficiency: Both the DSL program (few tokens per instance) and a low-latency backbone reduce compute and memory overhead during inference.
Coverage: Character- and line-granularity permits more surgical, targeted curation, surpassing document-level and coarse programmatic approaches in fine control.
Limitations: The pipeline requires expensive large-scale expert LLM inference (Qwen2.5–72B), incurring substantial GPU-hour cost (~12k GPU-hrs). The scope is presently limited to deletions; carefully introduced substitution or metadata edits could further enhance quality but may risk style drift or information loss.
This suggests a future extension could incorporate proprietary expert annotators (e.g., GPT-4) for the distillation phase, and revisit carefully bounded non-deletion edit operations to balance data quality with distribution fidelity (Bi et al., 4 Jul 2025).