RefineX: Scalable Data Refinement for LLMs

Updated 3 July 2026

RefineX is a programmatic data refinement framework that systematically removes low-quality text spans using efficient, character-level deletion programs.
It employs an expert-guided distillation pipeline and minimal edit distance techniques to generate precise deletion operations while retaining natural language quality.
Scalable across massive corpora, RefineX enhances LLM performance by ensuring minimal artifact introduction and preserving inherent linguistic features.

RefineX is a programmatic data refinement framework for large-scale pre-training corpora, designed to systematically improve the quality of data prior to training LLMs. Unlike traditional rule-based document-level filtering or neural generative refinement, RefineX implements efficient, fine-grained, character-level deletion programs distilled from expert-guided end-to-end edits. This methodology enables scalable, precise, and reliable removal of irrelevant or low-quality spans, while preserving the diversity and naturalness of the pre-training distribution and minimizing the introduction of artifacts.

1. Formalization and Scope

Let $X = \{t_1, \dots, t_N\}$ be a raw text corpus, where each $t_i$ is a character sequence. The objective is to construct a refinement operator $R$ such that, when applied instance-wise ( $D_R = R(X)$ ), the downstream performance $Perf(\theta; T)$ of an LLM with parameters $\theta$ —trained only on $D_R$ , under a fixed compute budget—is maximized over a task set $T$ . The search for $R$ is restricted to character-level, deletion-only actions: $R^* = \argmax_{R \in \mathcal{R}} \max_{\theta: \mathrm{Pretrain}(\theta; R(X))} Perf(\theta; T)$ The admissible $t_i$ 0 comprises only deletion programs, excluding insertions and substitutions, to ensure operational minimality and avoid style drift or information “hallucination” (Bi et al., 4 Jul 2025).

2. Expert-Guided Distillation Pipeline

The refinement process consists of a two-stage pipeline:

a. Seed Corpus Sampling

Raw data $t_i$ 1 is scored by the DataMan method, assigning a quality-level (1–5) to each document.
Approximately 5 million documents are sampled, maintaining the DataMan stratification.
Each long document is split into overlapping 12,000-character chunks to fit the expert LLM input window.

b. Expert End-to-End Deletion

The Qwen2.5–72B–Instruct LLM is prompted to refine, with explicit instructions for deletion-only editing and retention of on-topic links.
Decoding hyperparameters: top- $t_i$ 2, top- $t_i$ 3.
Outputs include a refined version $t_i$ 4 of each input $t_i$ 5 and an enumerated “modification_reason”.
The resulting $t_i$ 6 pairs form a high-quality but expensive and potentially over-edited demonstration set.

c. Minimal Edit Distillation

For each $t_i$ 7, the minimal edit distance is computed via Levenshtein algorithms (Python difflib.get_opcodes), yielding $t_i$ 8, a sequence of deletions, insertions, and replacements realising $t_i$ 9.
All non-deletion ops are dropped, and only deletions $R$ 0 (each contiguous span) are retained.
Each $R$ $R$ 1 is mapped to a domain-specific language (DSL) statement:
1. remove_lines(start_line, end_line)
2. remove_str(line, del_str)
3. keep_all() (no modification)
Filtering removes: (i) cases with insertion/replace spans $R$ 2 20 chars, (ii) trivial examples ( $R$ 310 chars deleted).
The finalized distillation set comprises about two million high-fidelity $R$ 4 pairs.

3. Model Architecture and Training

The program generator is a decoder-only Transformer initialized as Qwen-3-Base (0.6B params), featuring:

24 layers, hidden size 1,536, 24 self-attention heads.
Vocabulary: 32k BPE.
Context window: $R$ 512k characters (approx 2k tokens).

The model input is of the form:

$R$ 8 Output is a linearized DSL sequence, for example: $R$ 9

Training objective is cross-entropy over the gold program tokens: $R$ 6 Regularization is achieved by layer dropout of 0.1, output length cap at 512, and a program length penalty. Scaling to 1.7B, 4B, 8B models offered marginal improvements, but with higher latency.

4. Corpus-Scale Inference and Guarantees

Corpus-wide refinement is executed with the following pseudocode:

$D_R = R(X)$ 0

This process is linear in total character count and per-chunk inference time. Large batch sizes (~8 chunks/GPU) are accommodated with vLLM for throughput, and overlapping chunks prevent context loss.

Only deletion operations are introduced, preserving the original token and n-gram statistics, natural distribution, and multimodal evidence in the data.

5. Experimental Protocol and Comparative Evaluation

a. Datasets and Baseline Methods

Base corpus: RedPajama-V2 (400B tokens pre-filter), subsampled/refined to 20B tokens per method.
Methods:
1. Raw (unfiltered)
2. Rule-based (Gopher, C4, FineWeb, Comb)
3. LLM-based filter (Prox-D)
4. Programmatic refine (ProX-C)
5. RefineX

b. Pretraining Details

LLM architectures: Llama-2 family at 350M and 750M scales.
Training regime: 10k steps, batch 1,024×2,048 tokens, total 20B tokens.
Hyperparameters: LR= $R$ 7 (cosine decay), AdamW, weight decay 0.1, 500-step warmup.

c. Evaluation Protocol

LightEval suite: ARC-C, ARC-E, CSQA, HellaSwag, MMLU (57 sub-tasks), OBQA, PIQA, SIQA, WinoGrande, SciQ.
Each accuracy measured zero-shot on 1,000 samples/task.

6. Quantitative Profile

Table: Downstream Accuracy (750M LLM, 20B Tokens)

Dataset Filter	Avg. Accuracy	Δ vs. Raw
Raw	41.6%	–
ProX-C	42.4%	+0.8%
RefineX	42.9%	+1.3%

Further, for the Comb baseline:

Comb: 42.1%, +ProX-C 42.4%, +RefineX 42.8%.
With Prox-D: 42.0% → +ProX-C 43.5% → +RefineX 44.7%.

Token efficiency: models pretrained on 10B RefineX-refined tokens match—or slightly exceed—the performance of 20B Comb-filtered tokens.

Instance-level quality (DataMan evaluation): For score=3 sub-corpus, RefineX improves 41.2% of documents (degrades 4.58%), with average post-refinement score 3.45. E2E refinement improves 59.0% but has higher risk of over-editing (degrades 3.39%).

Over-editing risk: Number of new words per 1,000 refined tokens — E2E 15.06, ProX-C 0.17, RefineX 0.00 for low-quality (score=2) spans, demonstrating the deletion-only constraint is effective in limiting artifact introduction.

7. Comparative Merits, Limitations, and Outlook

Reliability: The staged distillation system—expert E2E to minimal deletion program—safeguards against model hallucinations and style drift. Supervision remains highly aligned with expert instructions, benefiting reproducibility and generalization.

Precision: Deletion-only edit programs serve as a conservative, reliable filter, minimizing spurious modifications and preserving original linguistic and stylistic features.

Efficiency: Both the DSL program (few tokens per instance) and a low-latency backbone reduce compute and memory overhead during inference.

Coverage: Character- and line-granularity permits more surgical, targeted curation, surpassing document-level and coarse programmatic approaches in fine control.

Limitations: The pipeline requires expensive large-scale expert LLM inference (Qwen2.5–72B), incurring substantial GPU-hour cost (~12k GPU-hrs). The scope is presently limited to deletions; carefully introduced substitution or metadata edits could further enhance quality but may risk style drift or information loss.

This suggests a future extension could incorporate proprietary expert annotators (e.g., GPT-4) for the distillation phase, and revisit carefully bounded non-deletion edit operations to balance data quality with distribution fidelity (Bi et al., 4 Jul 2025).

Markdown Report Issue Upgrade to Chat

References (1)

RefineX: Learning to Refine Pre-training Data at Scale from Expert-Guided Programs (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RefineX.