Papers
Topics
Authors
Recent
Search
2000 character limit reached

RefineX: Scalable Data Refinement for LLMs

Updated 3 July 2026
  • RefineX is a programmatic data refinement framework that systematically removes low-quality text spans using efficient, character-level deletion programs.
  • It employs an expert-guided distillation pipeline and minimal edit distance techniques to generate precise deletion operations while retaining natural language quality.
  • Scalable across massive corpora, RefineX enhances LLM performance by ensuring minimal artifact introduction and preserving inherent linguistic features.

RefineX is a programmatic data refinement framework for large-scale pre-training corpora, designed to systematically improve the quality of data prior to training LLMs. Unlike traditional rule-based document-level filtering or neural generative refinement, RefineX implements efficient, fine-grained, character-level deletion programs distilled from expert-guided end-to-end edits. This methodology enables scalable, precise, and reliable removal of irrelevant or low-quality spans, while preserving the diversity and naturalness of the pre-training distribution and minimizing the introduction of artifacts.

1. Formalization and Scope

Let X={t1,,tN}X = \{t_1, \dots, t_N\} be a raw text corpus, where each tit_i is a character sequence. The objective is to construct a refinement operator RR such that, when applied instance-wise (DR=R(X)D_R = R(X)), the downstream performance Perf(θ;T)Perf(\theta; T) of an LLM with parameters θ\theta—trained only on DRD_R, under a fixed compute budget—is maximized over a task set TT. The search for RR is restricted to character-level, deletion-only actions: R=arg maxRRmaxθ:Pretrain(θ;R(X))Perf(θ;T)R^* = \argmax_{R \in \mathcal{R}} \max_{\theta: \mathrm{Pretrain}(\theta; R(X))} Perf(\theta; T) The admissible tit_i0 comprises only deletion programs, excluding insertions and substitutions, to ensure operational minimality and avoid style drift or information “hallucination” (Bi et al., 4 Jul 2025).

2. Expert-Guided Distillation Pipeline

The refinement process consists of a two-stage pipeline:

a. Seed Corpus Sampling

  • Raw data tit_i1 is scored by the DataMan method, assigning a quality-level (1–5) to each document.
  • Approximately 5 million documents are sampled, maintaining the DataMan stratification.
  • Each long document is split into overlapping 12,000-character chunks to fit the expert LLM input window.

b. Expert End-to-End Deletion

  • The Qwen2.5–72B–Instruct LLM is prompted to refine, with explicit instructions for deletion-only editing and retention of on-topic links.
  • Decoding hyperparameters: top-tit_i2, top-tit_i3.
  • Outputs include a refined version tit_i4 of each input tit_i5 and an enumerated “modification_reason”.
  • The resulting tit_i6 pairs form a high-quality but expensive and potentially over-edited demonstration set.

c. Minimal Edit Distillation

  • For each tit_i7, the minimal edit distance is computed via Levenshtein algorithms (Python difflib.get_opcodes), yielding tit_i8, a sequence of deletions, insertions, and replacements realising tit_i9.
  • All non-deletion ops are dropped, and only deletions RR0 (each contiguous span) are retained.
  • Each RR1 is mapped to a domain-specific language (DSL) statement:

    1. remove_lines(start_line, end_line)
    2. remove_str(line, del_str)
    3. keep_all() (no modification)
  • Filtering removes: (i) cases with insertion/replace spans RR2 20 chars, (ii) trivial examples (RR310 chars deleted).

  • The finalized distillation set comprises about two million high-fidelity RR4 pairs.

3. Model Architecture and Training

The program generator is a decoder-only Transformer initialized as Qwen-3-Base (0.6B params), featuring:

  • 24 layers, hidden size 1,536, 24 self-attention heads.
  • Vocabulary: 32k BPE.
  • Context window: RR512k characters (approx 2k tokens).

The model input is of the form:

RR8 Output is a linearized DSL sequence, for example: RR9

Training objective is cross-entropy over the gold program tokens: RR6 Regularization is achieved by layer dropout of 0.1, output length cap at 512, and a program length penalty. Scaling to 1.7B, 4B, 8B models offered marginal improvements, but with higher latency.

4. Corpus-Scale Inference and Guarantees

Corpus-wide refinement is executed with the following pseudocode:

DR=R(X)D_R = R(X)0

This process is linear in total character count and per-chunk inference time. Large batch sizes (~8 chunks/GPU) are accommodated with vLLM for throughput, and overlapping chunks prevent context loss.

Only deletion operations are introduced, preserving the original token and n-gram statistics, natural distribution, and multimodal evidence in the data.

5. Experimental Protocol and Comparative Evaluation

a. Datasets and Baseline Methods

  • Base corpus: RedPajama-V2 (400B tokens pre-filter), subsampled/refined to 20B tokens per method.
  • Methods:

    1. Raw (unfiltered)
    2. Rule-based (Gopher, C4, FineWeb, Comb)
    3. LLM-based filter (Prox-D)
    4. Programmatic refine (ProX-C)
    5. RefineX

b. Pretraining Details

  • LLM architectures: Llama-2 family at 350M and 750M scales.

  • Training regime: 10k steps, batch 1,024×2,048 tokens, total 20B tokens.
  • Hyperparameters: LR=RR7 (cosine decay), AdamW, weight decay 0.1, 500-step warmup.

c. Evaluation Protocol

  • LightEval suite: ARC-C, ARC-E, CSQA, HellaSwag, MMLU (57 sub-tasks), OBQA, PIQA, SIQA, WinoGrande, SciQ.
  • Each accuracy measured zero-shot on 1,000 samples/task.

6. Quantitative Profile

Table: Downstream Accuracy (750M LLM, 20B Tokens)

Dataset Filter Avg. Accuracy Δ vs. Raw
Raw 41.6%
ProX-C 42.4% +0.8%
RefineX 42.9% +1.3%

Further, for the Comb baseline:

  • Comb: 42.1%, +ProX-C 42.4%, +RefineX 42.8%.
  • With Prox-D: 42.0% → +ProX-C 43.5% → +RefineX 44.7%.

Token efficiency: models pretrained on 10B RefineX-refined tokens match—or slightly exceed—the performance of 20B Comb-filtered tokens.

Instance-level quality (DataMan evaluation): For score=3 sub-corpus, RefineX improves 41.2% of documents (degrades 4.58%), with average post-refinement score 3.45. E2E refinement improves 59.0% but has higher risk of over-editing (degrades 3.39%).

Over-editing risk: Number of new words per 1,000 refined tokens — E2E 15.06, ProX-C 0.17, RefineX 0.00 for low-quality (score=2) spans, demonstrating the deletion-only constraint is effective in limiting artifact introduction.

7. Comparative Merits, Limitations, and Outlook

Reliability: The staged distillation system—expert E2E to minimal deletion program—safeguards against model hallucinations and style drift. Supervision remains highly aligned with expert instructions, benefiting reproducibility and generalization.

Precision: Deletion-only edit programs serve as a conservative, reliable filter, minimizing spurious modifications and preserving original linguistic and stylistic features.

Efficiency: Both the DSL program (few tokens per instance) and a low-latency backbone reduce compute and memory overhead during inference.

Coverage: Character- and line-granularity permits more surgical, targeted curation, surpassing document-level and coarse programmatic approaches in fine control.

Limitations: The pipeline requires expensive large-scale expert LLM inference (Qwen2.5–72B), incurring substantial GPU-hour cost (~12k GPU-hrs). The scope is presently limited to deletions; carefully introduced substitution or metadata edits could further enhance quality but may risk style drift or information loss.

This suggests a future extension could incorporate proprietary expert annotators (e.g., GPT-4) for the distillation phase, and revisit carefully bounded non-deletion edit operations to balance data quality with distribution fidelity (Bi et al., 4 Jul 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RefineX.