FinePhrase Dataset
- FinePhrase is a synthetic dataset that converts web documents into educational formats including math word problems, FAQs, markdown tables, and tutorials.
- It uses pedagogical structured prompts and instruction-tuned models to generate 1.35 billion example–completion pairs with a 24× reduction in computational cost.
- Empirical evaluations show that models pretrained with FinePhrase outperform previous baselines on various natural language understanding and reasoning benchmarks.
FinePhrase is a 486-billion-token synthetic dataset specifically designed for large-scale LLM pretraining, generated through systematic rephrasing of high-quality web documents into pedagogically structured formats. Developed through controlled analyses of prompt design, generator architecture, and source data selection, FinePhrase offers substantial improvements in downstream model performance while achieving significant reductions in token-generation computational cost. The dataset is openly released for research purposes and is hosted on Hugging Face Datasets (Niklaus et al., 15 Apr 2026).
1. Dataset Scale and Structure
FinePhrase comprises 1.35 billion example–completion pairs, amounting to 486 billion tokens. The corpus is systematically partitioned into four equal subsets, each corresponding to a pedagogically motivated rephrasing style:
- FinePhrase-Math: Word problems constructed from source text, each paired with a detailed step-by-step solution.
- FinePhrase-FAQ: Comprehensive, self-contained Q&A formats derived from document content.
- FinePhrase-Table: Markdown tables summarizing essential document facts, followed by a question-answer pair explicitly focused on table content.
- FinePhrase-Tutorial: Numbered or bulleted lists providing step-by-step instructional content.
All content is derived by programmatically reformatting documents sampled from FineWeb-HQ, a high-quality filtered web crawl. Each subset contains approximately 25% of the total dataset. FinePhrase is predominantly used as a synthetic component in mixed pretraining, typically concatenated or interleaved 1:1 with original web tokens.
| Subset Name | Format Description | Proportion of Dataset |
|---|---|---|
| FinePhrase-Math | Word problems + stepwise solutions | ~25% |
| FinePhrase-FAQ | Self-contained Q&A | ~25% |
| FinePhrase-Table | Markdown tables + 1 QA pair | ~25% |
| FinePhrase-Tutorial | Numbered or bulleted steps | ~25% |
2. Data Generation Pipeline
FinePhrase employs "pedagogical structured prompts" to induce LLMs to rephrase web documents into distinct instructional formats, maximizing educational utility and content diversity:
- Math problems are formulated from text-inferred relationships or numerical data, outputting both the problem and detailed LaTeX-style solution.
- FAQ prompts direct the model to extract, order, and answer key questions.
- Table prompts instruct the generation of structured summaries in Markdown followed by a QA pair.
- Tutorial prompts yield procedural guides in list form.
Generation is performed using a suite of instruction-tuned models:
- SmolLM2 (135M, 360M, 1.7B parameters), including explicit rephrasing and instructional capabilities.
- Gemma 3 (270M, 1B, 4B, 12B, 27B).
- Additional architectures for ablation studies (Falcon 3, Qwen 3, Granite 3.1, Llama 3.2, typically ~1B parameters).
Empirical findings indicate no consistent downstream performance gain for rephrasing with models beyond ~1B parameters.
On the infrastructural side, the final dataset was generated with SmolLM2 1.7B using suffix-32 speculative decoding across 100 H100 GPUs, achieving ≈9,200 tokens/sec per GPU (33.1M tokens/GPU-hour) and a total generation compute of ≈14,700 GPU-hours (≈612 GPU-days).
3. Source Data Selection and Mixing Regimen
FinePhrase samples are exclusively derived from FineWeb-HQ (FWHQ) web pages, filtered with the FineWeb-Edu classifier (score ≥ 4 on a 0–5 scale) to maximize instructional value and minimize low-quality or irrelevant content. For model pretraining, synthetic FinePhrase tokens are combined with original web data (DCLM, Cosmopedia, FineWeb-LQ, or FWHQ) in a 50/50 mix. Ablation studies show optimal generalization and task coverage when DCLM and FWHQ are used as mixing partners.
Dataset quality is quantitatively monitored by:
- FineWeb-Edu classifier scores for educational content.
- Output-length variance and analysis of repeated patterns to detect "template collapse" or diminished sample diversity.
4. Empirical Evaluation and Comparative Results
The efficacy of FinePhrase is assessed by pretraining a 1.2B Qwen 2 model for 21B tokens on a 50/50 synthetic-original token mixture and evaluating on twelve industry-standard natural language understanding and reasoning benchmarks (e.g., ARC, MMLU, SQuAD v2, GSM8K, HellaSwag). Macro-averaged accuracy across benchmarks is reported.
Key downstream outcomes include:
- FinePhrase-Table: 17.18 macro-average accuracy (vs. DCLM baseline 13.77; vs. Nemotron-HQ-Synth 13.54).
- FinePhrase-Math: 15.31 accuracy (+1.54 vs. DCLM).
- FinePhrase-FAQ: 14.45; FinePhrase-Tutorial: 14.30. All variants outperform DCLM and prior synthetic data baselines, with structured prompts conferring notable advantages.
| Synthetic Subset | Macro-Average Accuracy | Δ vs. DCLM |
|---|---|---|
| FinePhrase-Table | 17.18 | +3.41 |
| FinePhrase-Math | 15.31 | +1.54 |
| FinePhrase-FAQ | 14.45 | +0.68 |
| FinePhrase-Tutorial | 14.30 | +0.53 |
5. Cost Efficiency and Computational Footprint
FinePhrase achieves substantial improvements in data-generation throughput and cost efficiency relative to prior synthetic corpora:
- Token-generation throughput reaches 33.1M tokens/GPU-hour, a ~30× speedup relative to the 1.1M tokens/GPU-hour reported for REWIRE (Llama 3.3 70B on 400B tokens).
- Total compute over 486B tokens is ≈14,700 GPU-hours, yielding a ≈24× reduction compared to REWIRE’s ≈352,000 GPU-hours for comparable volume (Niklaus et al., 15 Apr 2026).
6. Dataset Access, Licensing, and Integration
FinePhrase is openly released for research purposes via Hugging Face Datasets (https://hf.co/datasets/HuggingFaceFW/finephrase). Users can install and load the dataset as follows:
1 2 |
from datasets import load_dataset dataset = load_dataset("HuggingFaceFW/finephrase") |
7. Context and Research Significance
FinePhrase establishes a high-efficiency paradigm for synthetic pretraining data, demonstrating that structured pedagogical rephrasings (math, tables, FAQ, tutorials) yield superior performance compared to unstructured or less-format-constrained outputs. Exhaustive ablations reveal the importance of both structured prompts and careful source dataset selection, with negligible benefit from increasing generator model scale beyond ~1B parameters for this domain. This suggests diminishing returns in scaling generator networks for the purposes of pedagogical rephrasing (Niklaus et al., 15 Apr 2026).
The empirical gains and cost-efficiency position FinePhrase as an important resource for the next generation of LLM training pipelines, offering a reproducible and scalable framework for constructing high-quality synthetic corpora.